Compare commits

..

125 Commits

Author SHA1 Message Date
f22d83df14 feat(#47 phase 0): centralize OpenAI error envelope + add Retry-After
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m26s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m48s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m12s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The rejection contract (#63) requires every "no" path to speak the
OpenAI envelope with standard codes and, for retryable conditions, a
Retry-After header. Two gaps remained despite #63 being closed:
Retry-After was implemented nowhere, and the envelope was hand-built
inline in four places (gateway handlers/proxy/router, neuron api) with
no shared source of truth — exactly the inconsistency #63 set out to
prevent, and a foundation every Stage 1-2 rejection (401/429/503) needs.

- cortex-core: new `error_envelope::OpenAiError` — an axum-agnostic
  builder carrying status, type, code, message, param, optional
  retry_after, and diagnostic extras. Named constructors encode the #63
  codes (invalid_api_key, rate_limit_exceeded, insufficient_quota,
  context_length_exceeded, service_unavailable) and which carry
  Retry-After. cortex-core stays a pure types crate; each HTTP crate
  owns a thin `envelope_response` adapter that sets the header.
- cortex-gateway: route error_response, ProxyError, and RouteError
  through the shared builder; RouteError::retry_after_secs wires
  Retry-After on the transient NoHealthyNodes (5s) / ModelRecovering
  (2s) variants.
- neuron: route inference_error_response through the shared builder;
  InsufficientVram (transient 503) now advertises Retry-After: 5.

Behaviour for existing paths is unchanged (same status/type/code/extras);
only the new Retry-After headers are added. Tests cover the builder wire
shape and Retry-After presence/absence on both sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:46:56 +03:00
4b28a64b34 feat(#67 phase 5b): enforce the derived input as the prompt cap
All checks were successful
CI / Format (push) Successful in 39s
CI / CUDA type-check (push) Successful in 1m38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m17s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
The request path now rejects prompts above the model's self-derived input
budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host
(where the VRAM ceiling binds below the static cap) rejects an
over-budget prompt up front instead of accepting it and OOMing
mid-prefill.

- derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed
  by LoadedHandle::derived_limit (runs on every /models poll). 0 = not
  derived yet.
- effective_prompt_cap(): cached derived input when >0, else the static
  max_prompt_tokens() (cold-start / no-profile fallback).
- validate_request takes the cap as a param; all 4 call sites
  (chat_completion, inference_stream, inference_tp_stream, TP
  chat_completion) pass the in-scope model's effective_prompt_cap().
- doc/context-limits.md: enforcement note updated from "remaining" to
  landed.

Reads the cap lock-free from the sync validate path (no per-request VRAM
query); the cap tracks live state via the poll-driven derivation. With
this, advertise and enforce agree and both track the resident model.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:26:37 +03:00
dd65eedb24 feat(#67 phase 5a): NEURON_MAX_PROMPT_TOKENS becomes a clamp-only backstop; docs
All checks were successful
CI / Format (push) Successful in 31s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Demotes the static per-host prompt cap from authority to an optional
upper-bound clamp on the self-derived limit, and rewrites the
context-limits doc around the computed model.

- max_prompt_tokens_clamp(): reads NEURON_MAX_PROMPT_TOKENS directly so
  "explicitly set" is distinct from the 16384 default; returns None when
  unset (no clamp). Applied as derive_limit's hard_ceiling in
  LoadedHandle::derived_limit, so the advertised context is clamped only
  when an operator set a backstop — the derivation is otherwise
  authoritative and binds below it in practice.
- doc/context-limits.md: intro + "After #62" rewritten as "After #67 —
  the neuron computes its own limit" (formula, live signals, config
  block, opencode note, NEURON_MAX_PROMPT_TOKENS demotion).

Remaining (phase 5b, follow-up): enforce the *derived* input as the
prompt cap (reject above computed input, not the static
NEURON_MAX_PROMPT_TOKENS) so VRAM-tight hosts can't accept an
OOM-inducing prompt. Needs a per-model cached cap read from the sync
validate path; scoped separately. Until then the static cap remains the
enforced backstop (advertised <= enforced holds when the env is set).

fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:14:34 +03:00
8b2e01a072 feat(#67 phase 4): advertise neuron-computed limit on /models; drop catalogue override
Some checks failed
CI / Test (push) Waiting to run
CI / Format (push) Successful in 35s
CI / CUDA type-check (push) Successful in 2m12s
CI / Clippy (push) Successful in 2m10s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The neuron now self-derives and advertises limit{context,input,output}
per loaded model; cortex forwards it and stops consulting the
operator-declared catalogue limit (which can't track hot-swapped models
or live capacity). Operator-set `cost` still flows from the catalogue.

neuron:
- CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]).
- LoadedHandle::derived_limit(): profile + live tightest-card free VRAM
  (single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate
  EMA (bootstrap until first sample) → derive_limit. None for arches
  without a context profile. No operator clamp here (advertise the honest
  derived value; the clamp is an enforcement-side backstop).
- list_models() fills ModelInfo.limit from derived_limit (was None).
- derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as
  "no VRAM ceiling" instead of collapsing to zero.

cortex:
- ModelEntry gains `limit`, copied from ModelInfo.limit by the poller.
- /v1/models: catalogue `limit` no longer flows (Pass 1 sets None);
  Pass 2 adopts the neuron's limit, taking the tightest across neurons
  via tightest_limit(). cost unchanged.
- model_limits.rs rewritten: catalogue limit (999999) is ignored; the
  neuron's ModelEntry.limit is advertised; cost still from catalogue.
- All ModelEntry literals updated with the new field.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:10:20 +03:00
464b6b0db9 feat(neuron): self-measured prefill tok/s EMA on streaming paths (#67 phase 3)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m30s
CI / CUDA type-check (push) Successful in 1m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Feeds the throughput ceiling a live, per-model prefill rate
instead of only the configured bootstrap estimate, so the advertised
limit tracks real prefill speed and rises automatically as prefix
caching (#11) reduces effective prefill cost.

- context_limit::PrefillRateEma: lock-free f64-bits EMA (alpha 0.3),
  ignores degenerate samples, None before the first sample. Unit-tested.
- prefill_rate field on LoadedModel + TpLoadedModel.
- Recorded as total-prompt-tokens / prefill-elapsed in the two streaming
  serving paths (TP: inference_tp_stream via tp_for_task; single-GPU:
  stream_inference_via_worker via a new &prefill_rate param threaded from
  loaded_for_task). Measuring total prompt (not just the divergent
  suffix) means a prefix-cache hit shrinks elapsed while the prompt stays
  large, so the effective rate — and the ceiling — rises toward the VRAM
  ceiling, exactly the #11 payoff.

Per the agreed scope, non-streaming + CPU paths fall back to the
bootstrap estimate (opencode streams; those paths rarely carry the
fleet). fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:02:02 +03:00
f2e05d96ec feat(neuron): capture ContextProfile at load + per-rank VRAM fan-out (#67 phase 2)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m38s
CI / CUDA type-check (push) Successful in 1m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Captures the per-model context physics at load and adds the
live free-VRAM signal the derivation needs — the tightest card across TP
ranks, not just the leader.

- ContextProfile captured at load:
  - single-GPU dense CUDA path (world_size 1) via
    context_limit::profile_from_qwen3_5_config(config_path, ..);
  - TP path (world_size = tp_size) at TpLoadedModel construction.
  GGUF/CPU/non-qwen3_5 → None (fall back to the static prompt cap).
  New `context_profile` field on LoadedModel + TpLoadedModel.
- profile_from_qwen3_5_config(): reads config.json (mirrors
  VisionMeta::from_config_path), counts full_attention layers
  (layer_types authoritative, full_attention_interval fallback), builds
  the per-card KV cost via the shared helper.
- Folded the inline per-rank KV-bytes math in tp_qwen3.rs (both
  cuda/non-cuda log_construction_complete) and tp_qwen3_5.rs onto
  context_limit::kv_bytes_per_token + KV_CACHE_DTYPE_BYTES.
- Per-rank VRAM fan-out (tightest card):
  - WorkerRequest::QueryVram + WorkerResponse::VramInfo { free_mb, total_mb };
  - worker.rs handle_query_vram (cuda: mem_get_info; non-cuda: error);
  - WorkerPool::query_vram_tightest_free_mb fans out to every rank
    (leader via its device worker, subprocess ranks via RPC) → min free;
  - TpLoadedModel::query_vram_tightest_free_mb convenience wrapper.

No advertise/enforce yet (phases 4/5). fmt/clippy/test green; CUDA paths
type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:18:27 +03:00
4f05a87449 feat(neuron): self-derived context-limit core — physics + policy (#67 phase 1)
All checks were successful
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. The correct limit{context,input,output} for a deployment is a
computed function of model architecture + live free VRAM + a
coherence/throughput trade-off, not an operator-declared static fact that
goes stale on model swap. This lands the arch-agnostic derivation core;
later phases capture per-model physics at load, measure throughput, and
advertise/enforce the computed limit.

- crates/neuron/src/harness/context_limit.rs (new):
  - kv_bytes_per_token(): shared per-card KV cost (counts only
    full-attention layers; sharded by TP world size). The TP load paths'
    inline math folds onto this in phase 2.
  - ContextProfile: per-model physics snapshot (max_position_embeddings,
    kv_bytes_per_token_per_card, world_size).
  - derive_limit(): context = min(max_pos, vram_ceiling,
    throughput_ceiling) clamped by an optional backstop; input = context −
    output; rounded to 1024. 6 unit tests.
- config.rs: [harness.candle.context_limit] block (mirrors prefix_cache):
  target_prefill_latency_secs, bootstrap_prefill_tok_per_sec,
  activation_headroom_mb, min_free_floor_mb, output_reserve_tokens.
- neuron.example.toml: documented the new block.

No runtime behaviour change yet. fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:00:52 +03:00
2f67d17ec7 feat(neuron): emit reasoning_tokens usage details on streaming
All checks were successful
CI / CUDA type-check (push) Successful in 1m45s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build cortex binary (push) Successful in 2m16s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m3s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Closes #64.

opencode meters reasoning tokens separately via the OpenAI-standard
detail objects, which neuron's usage structs didn't expose. Add them
additively so older clients ignore them.

- cortex-core: Usage gains completion_tokens_details/prompt_tokens_details;
  ResponsesUsage gains output_tokens_details/input_tokens_details. Optional
  + skip_serializing_if, so the wire shape is unchanged for non-reasoning
  models. cached_tokens fields are defined but always None until prompt
  caching lands (#11).
- candle.rs: count tokens generated while in_reasoning across all three
  streaming paths (TP, worker, CPU); carry the count on InferenceEvent::Finish.
- chat projector: populate completion_tokens_details.reasoning_tokens.
- responses projector: wire up base usage emission on the streaming path
  (it emitted none before) and add output_tokens_details.reasoning_tokens.
- non-streaming paths leave details None (they don't track in_reasoning).

reasoning_tokens is a sub-count of completion/output tokens (OpenAI
semantics) — not added into total_tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 12:04:05 +03:00
11b2e6f78c fix(cortex): default models_config to the packaged absolute path
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Test (push) Successful in 4m26s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
cortex resolved the catalogue path "models.toml" relative to the service's
working directory, so the systemd-launched binary never found
/etc/cortex/models.toml and ran with an EMPTY catalogue in production —
limits, cost, pinning, aliases and feasibility were all silent no-ops,
with models surfacing only via the neuron poller. Tests never caught it
because they pass models_config explicitly; only the defaulted,
packaged path was broken.

Default to the absolute /etc/cortex/models.toml (where cortex.spec installs
it) and document the override in cortex.example.toml. Restores the #62
limit/cost advertisement (the catalogue is now actually read) along with
pinning/aliases/feasibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 10:04:29 +03:00
8a636c687f feat(cortex): per-model limit + cost on /v1/models; remove max_model_len
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build neuron-blackwell (push) Successful in 1m36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build neuron-ampere (push) Successful in 2m47s
build-prerelease / Build helexa-bench binary (push) Successful in 2m8s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 5m13s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Resolves #62. opencode's helexa provider discovers a model's serving
budget from /v1/models and uses it to size context, trigger compaction,
and show spend with no hand-configuration. Each model entry now carries:

  - limit { context, input?, output }  — operator-declared in models.toml
  - cost  { input, output, cache_read?, cache_write? }  — USD per 1M tokens
  - tool_call / reasoning  — runtime-detected by the candle harness and
    OR-ed in from each serving neuron

Composition: the catalogue profile supplies limit/cost (Pass 1); the
poller carries the neuron's detected tool_call/reasoning into ModelEntry,
which the gateway unions onto the entry (Pass 2); aliases propagate every
field (Pass 4). Wire types extend ModelInfo / ModelProfile /
CortexModelEntry additively (serde default + skip_serializing_if), so
older neurons and clients are unaffected. helexa-bench's ModelInfo
constructor and the gateway test fixtures are updated for the new fields.
Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost
(catalogue) and tool_call + reasoning (runtime), and that max_model_len
is gone.

Removes max_model_len. It was write-only with no consumer — opencode's
source references it nowhere and it is not an OpenAI /v1/models field —
and doubly misleading: vLLM's max_model_len means total sequence length,
but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap.
The limit{} contract replaces it. The neuron's max_prompt_tokens remains
the enforced prompt cap (neuron-side); cortex just stops re-advertising a
derived, mis-named copy. Closes #66 — its stale-max_model_len premise is
moot once the field is gone.

limit/cost are operator-declared (catalogue) per #62's design; auto-
deriving the advertised budget from each neuron's reported cap is a
tracked follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 09:26:55 +03:00
6088830e7d feat(deploy): manage NEURON_MAX_PROMPT_TOKENS per host via model.conf drop-in
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Roll the per-model context cap into deploy.yml so it is deterministic per
host and rolled out (with a restart) alongside the rest of the service
config, rather than hand-edited in local.conf. The deploy now writes
/etc/systemd/system/neuron.service.d/model.conf from a new per-host
`max_prompt_tokens` matrix field, and restarts a neuron when the package
OR the drop-in changes — so a cap change applies even with no new RPM.

beast (Qwen3.6-27B, hybrid linear, 2x 32GB) -> 131072 (~128k); benjy and
quadbrat (dense, VRAM-bound) stay at 16384 but become deploy-managed.

Adds the scoped sudoers grant for the root-owned drop-in install, and
doc/context-limits.md documenting the knob relationships and KV/VRAM math
(refs #62 for the eventual /models-advertised source of truth, #65 for
the length-aware text VRAM guard that gates pushing beyond 128k).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 18:48:19 +03:00
04f798ec23 feat(cortex-gateway): enhance error responses with structured data
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Test (push) Successful in 4m23s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 47s
fixes #63
Standardize error messages by adding type, code, and param fields to
align with OpenAI API format. Updates include:
- Structured error envelopes with broad type categorization
  (invalid_request_error/api_error)
- Specific machine-readable codes (model_not_found/service_unavailable)
- Null param field as required by OpenAI specification
- Consistent error response formatting across handlers, proxy, and
  routing layers

New tests verify correct error envelope structure for various failure
scenarios.

Co-Authored-By: Helexa (Qwen3.6-27B, 48k context) <noreply@helexa.ai>
2026-06-16 17:51:04 +03:00
6f3e9276cd docs: add AGENTS.md with project architecture, build commands, and conventions
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Test (push) Successful in 4m37s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-16 14:15:32 +03:00
8f9e956d17 fix(neuron): emit OpenAI-standard nested error envelopes (#60)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m44s
build-prerelease / Build neuron-ada (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
InferenceError responses were a flat `{"error": "..."}` string. OpenAI
clients (opencode, the openai SDK) reach into `error.type`/`error.code`
to drive behaviour — most importantly `code == "context_length_exceeded"`
triggers auto-compaction + retry instead of a hard failure. A flat string
is invisible to that logic.

Rewrite `inference_error_response` to emit the nested envelope
`{"error": {"message","type","code","param", ...diagnostics}}` and map:

- ModelNotLoaded   → 404 invalid_request_error / model_not_found
- PromptTooLong    → 400 invalid_request_error / context_length_exceeded
  (message: "maximum context length is N tokens", + prompt_len/max)
- InsufficientVram → 503 api_error / insufficient_vram
- VisionUnsupported→ 400 invalid_request_error / vision_unsupported
- TemplateRenderFailed → 422 invalid_request_error / template_render_failed
- Other            → 500 api_error / null code

Diagnostic extras ride inside the error object so the envelope shape is
stable. Both inline match blocks in the chat-completions handler
(streaming + non-streaming) now defer to the shared helper, which the
responses handler already used — one source of truth.

Adds 4 unit tests covering the envelope shape and codes. Also fixes a
pre-existing clippy lint (cloned_ref_to_slice_refs) in qwen3_5 snapshot
test surfaced by a newer clippy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 20:42:14 +03:00
cb758d4706 feat(neuron): emit usage on the streaming path so clients can track context
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m46s
build-prerelease / Build neuron-ada (push) Successful in 2m9s
build-prerelease / Build cortex binary (push) Successful in 2m24s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Test (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
The deeper reason opencode showed "Context: 0 tokens / 0% used" and flew
into a 400: streaming responses carried NO `usage`. Clients track context
(and trigger compaction) from the `usage` field; the legacy candle
streaming path set `usage: None` on every chunk, so a streaming client
had no token count at all — `max_model_len` alone is a denominator with
no numerator.

InferenceEvent::Finish now carries prompt_tokens + completion_tokens
(the streaming loops already have both: prompt_tokens.len() and the
generated all_tokens.len()). The openai_chat projector emits an
OpenAI-style trailing usage chunk (empty `choices`, populated `usage`)
after the finish chunk. cortex's Anthropic stream translator already
reads chunk.usage, so this fixes context tracking on BOTH the OpenAI
(opencode) and Anthropic (Claude Code) paths.

Also harden the max_model_len plumbing's sibling: cortex re-polls
/discovery while a neuron's max_prompt_tokens is still 0 (unknown), so a
rolling-deploy race where cortex caches discovery before the neuron has
the field self-heals instead of pinning max_model_len to None until a
manual cortex restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:43:59 +03:00
a2d2dbd006 feat: advertise max_model_len on /v1/models so clients can compact
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build helexa-bench binary (push) Successful in 2m0s
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
opencode (and any OpenAI/Anthropic client) couldn't size or compact its
context against helexa because /v1/models never advertised a context
window — opencode showed "0 tokens / 0% used" and flew straight into a
400 PromptTooLong once a conversation + a fetched 64KB log overflowed the
49152-token cap. Compaction is the client's job, but the client needs to
know the limit to do it.

neuron now reports its effective prompt cap (NEURON_MAX_PROMPT_TOKENS)
in GET /discovery (`max_prompt_tokens`). cortex surfaces it on
/v1/models as `max_model_len` (vLLM / OpenAI-compatible convention) per
model — the smallest cap among the neurons that can serve it
(feasible_on ∪ locations), so the advertised limit holds wherever the
request routes. A neuron reporting 0 predates the field and is treated
as unknown (skipped); models with no reporting neuron omit the field.

helexa still rejects over-limit prompts with a clean 400 — this just
gives clients the number to compact *before* hitting it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:11:13 +03:00
544214d0f8 fix(neuron): normalize OpenAI string tool-call arguments before rendering
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m39s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
opencode (OpenAI path, /v1/chat/completions passthrough) hit the same
chat_template:120 failure Claude Code did — "cannot convert value into
pairs" — because the OpenAI wire format carries
tool_calls[].function.arguments as a JSON *string*, while Qwen3.6's
template iterates it as a dict (`arguments | items`). The Anthropic-side
fix (8880b2f) only covered cortex's translation; the OpenAI path reaches
neuron unchanged.

render_chat_template now normalizes string-form tool-call arguments to
objects across all messages before building the Jinja context, so OpenAI
and Anthropic clients both render. Object args (Anthropic path) pass
through untouched; a string that doesn't parse is left as-is and the
render fails loudly (422 TemplateRenderFailed, a94dd55) rather than
silently dropping tools.

The loud-fail change earned out immediately here: opencode got a clean
422 with the exact `chat_template:120` cause instead of a degraded
session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:13:36 +03:00
a94dd55ab8 feat(neuron): fail loud (422) when a tools-bearing request can't render
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m10s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Three of this session's bugs (system-message position, tool_call argument
shape, and the original tool rendering) all hid behind the same silent
behaviour: chat_template render fails → neuron falls back to
format_qwen3_prompt, which drops every tool → the request still returns
200 with degraded, tool-less output. Each cost real debugging time
because the failure was invisible on the wire.

build_prompt_for_request now returns Result. On a render failure it
checks whether the request carried tools: if so it returns the new
InferenceError::TemplateRenderFailed (mapped to 422 with a
template_render_failed code and the underlying Jinja error), instead of
silently degrading. A render failure with no tools still falls back
quietly — there's nothing to lose, and `format_qwen3_prompt` is a
reasonable text-only prompt. The four prompt-build call sites propagate
with `?`.

Now the next client/template incompatibility surfaces as a loud 422 the
operator sees immediately, not a mysteriously-degraded session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:48:31 +03:00
8880b2f8a6 fix(cortex): emit tool_call arguments as an object so Qwen3.6 can chain tools
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Verified live via the rendered-prompt trace: once a tool call is in the
conversation history, the Qwen3.6 chat template fails to render —

  render chat_template: invalid operation: cannot convert value into
  pairs (in chat_template:120)

because line 120 iterates `tool_call.arguments | items` (treats arguments
as a dict), while cortex emitted the OpenAI-standard JSON *string*. On
that render error neuron silently falls back to a tool-less prompt, so
the model loses every tool the moment it makes one call — it can make the
first tool call, read the result, then can only narrate ("now let me
check the runs") and stop, because the next turn has no tools. That's the
"drops the ball a little later" symptom: the CC trace shows the get_me
turn rendering 42653 tokens (tools present) and every subsequent
tool-history turn falling back to ~6k tokens (tools gone).

anthropic_to_openai now passes `function.arguments` as the parsed object
rather than stringifying it. Tests updated to expect the object form.

This is the same silent-fallback failure class as the system-message
merge (295b10c) — which is why making neuron's template-render fallback
LOUD (4xx on a tools-bearing request instead of a degraded 200) is now
clearly worth doing: it would have surfaced both in seconds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:43:17 +03:00
4e8f4e0d04 fix(neuron): don't generate <think> reasoning when the client drops it
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m50s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m15s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Verified live: Qwen/Qwen3.6-27B with a simple prompt and max_tokens=400
generated 400 tokens, finish_reason=length, and 0 visible characters —
the model spent the ENTIRE budget on <think> reasoning, which we then
drop for OpenAI/Anthropic clients (include_thinking=false), starving the
visible answer. This is why Claude Code "dropped the ball": empty or
truncated responses. A/B confirms the cause — same prompt with
chat_template_kwargs.enable_thinking=false yields a full 545-char answer.

The earlier prompt_opens_reasoning fix stopped the reasoning *leaking* as
text but left it consuming the token budget. Couple the two: when the
caller isn't going to see the reasoning (include_thinking=false, the
default), default chat_template_kwargs.enable_thinking to false so the
model doesn't generate it. An explicit client enable_thinking wins;
thinking-aware clients (helexa-acp, x-include-thinking: true) keep
reasoning on. Tests cover the default (false), surfacing (true), explicit
override, and preservation of other kwargs.

Note: only the /v1/chat/completions path (what Claude Code uses via
cortex /v1/messages); /v1/responses could get the same defaulting as a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 15:00:50 +03:00
295b10c103 fix(cortex): merge all system content into one leading system message
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Build cortex binary (push) Successful in 2m54s
build-prerelease / Test (push) Successful in 5m21s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Package cortex RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Verified live via neuron trace: Claude Code's real requests carry a
top-level `system` AND a `role:"system"` turn inside `messages`. cortex
passed the latter through at a non-first position, and Qwen3.6's chat
template hard-rejects it:

  WARN chat_template render failed; falling back to format_qwen3_prompt
  error=... invalid operation: System message must be at the beginning.

On that render error neuron silently falls back to a template that
renders NO tools, so the model got zero tool-format guidance and
improvised an unparseable `<tool><name>…` syntax — tool calling broke
entirely for real CC traffic, even though synthetic single-system
probes (and the earlier translation/parse fixes) worked.

anthropic_to_openai now accumulates the top-level `system` plus every
`role:"system"` conversation turn and emits a single system message at
index 0, with the non-system turns following in order. Reproduced the
trigger (system-role message at index>0 → fallback) and the fix
(merged → template renders tools). Test covers the merge + ordering.

Secondary hardening worth a follow-up: neuron's silent template
fallback drops tools without surfacing it to the client — a render
failure on a tools-bearing request should arguably 4xx rather than
degrade invisibly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:09:08 +03:00
1c485aedce feat(neuron): trace the fully rendered chat-template prompt
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 27s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 2m13s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m45s
build-prerelease / Build neuron-ada (push) Successful in 3m31s
build-prerelease / Test (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Debugging tool-call format drift (Qwen3.6-27B emitting wrapper-less
<tool><name>…> under Claude Code's real system prompt + 120-tool list,
which neuron's <tool_call> detector can't parse) needs ground truth on
what the model actually sees. neuron logged nothing about the rendered
prompt. Add a trace! in build_prompt_for_request emitting the full
rendered prompt + char count + tool count, so we can see whether the
chat template's <tool_call> format instruction survives a large system
prompt and how the tools render. Gated at trace (the prompt can be tens
of KB): RUST_LOG=neuron::harness::candle=trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:38:51 +03:00
b3dc835375 ci: bound job runtime + stop dropping sccache on rustc signal-death
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Test (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
A neuron-blackwell build hung ~90 min (siblings finished in 2) and there
was no job timeout to kill it, so it sat burning a runner. Root cause of
the hang: the inline retry loop treated every failure identically and, on
its final attempt, rebuilt with sccache disabled. When the real failure
is a rustc SIGSEGV or an OOM-kill, an uncached rebuild does *more* work
under the same memory pressure — turning one transient compiler crash
into a wedged job.

Two fixes:

1. timeout-minutes on every job in build-prerelease.yml and ci.yml
   (builds 25, neuron CUDA build/cuda-check 35, packaging 20, COPR 60,
   fast jobs 10-15). A hang now dies in minutes, not hours.

2. New script/ci-cargo-escalate.sh replaces the five (prerelease) + three
   (ci) inline escalation loops. It classifies the failure:
     - signal death (exit >=128, or cargo reporting `signal: N`/SIGSEGV/
       SIGKILL) → compiler crash, NOT an sccache fault: keep the cache,
       one warm retry, then fail fast. Never escalate to uncached.
     - sccache fault (recognisable sccache error) → restart the server,
       retry, then one final uncached attempt.
     - deterministic compile/test error → fail fast (no wasteful retry).
   It also folds in the CUDA-image sccache probe the neuron/cuda-check
   jobs did inline. Classification verified locally against success,
   plain failure, exit-139, and the cargo-wrapped `signal: 11` form.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:02:50 +03:00
746d84c0fb fix(neuron): seed in_reasoning from the prompt so Qwen3.6 thinking isn't leaked
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m3s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Qwen3.6's chat template injects the opening <think> into the generation
prompt, so generation begins mid-thought and the open marker is never
sampled. The streaming loops flipped in_reasoning to true only on a
*generated* open token, so they stayed in text mode and streamed the
model's reasoning out as visible text — verified live: a tool request
returned a 255-char text block of chain-of-thought ("The user wants to
know the weather… I will construct the function call now.") ahead of the
tool_use block, with the trailing </think> stripped (close token
recognised) but no opening <think>.

Each streaming loop now seeds in_reasoning by replaying the prompt's
reasoning markers (new `prompt_opens_reasoning`): if the prompt ends
inside an open <think>, the loop starts in reasoning mode, the thinking
routes to ReasoningDelta (dropped by the chat projector's default
include_thinking=false, which is what cortex uses), and the model's
</think> flips back to visible text for the answer/tool call. Template-
agnostic and self-correcting: a prompt that doesn't open reasoning (no
think injection, enable_thinking off, non-reasoning model) starts false,
preserving current behaviour. Thinking is hidden, not disabled, so answer
quality is unaffected.

Applied to all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming). Test covers
open/close replay, multi-turn closed state, reopen-at-tail, and the
no-pair pass-through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:03:26 +03:00
f15b9e2848 fix(neuron): parse Qwen-XML tool calls + emit tool_use stop_reason
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 2m2s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m16s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Verified live (commit d662fa2 logs): cortex now delivers OpenAI-shaped
tools to neuron correctly, but Qwen3.6-27B emits tool calls in the
Qwen-XML form inside the <tool_call> markers —

    <tool_call>
    <function=get_weather>
    <parameter=city>
    Brno
    </parameter>
    </function>
    </tool_call>

— while parse_tool_call_body only did serde_json::from_str expecting
{"name":…,"arguments":…}. It returned None, the dispatch re-emitted the
raw block as a text delta, and clients saw the markup as prose. cortex
logged upstream_tool_calls=false finish_reason="stop".

parse_tool_call_body is now format-tolerant: JSON first (Qwen3-Instruct
/ Hermes), then a Qwen-XML parser (Qwen3-Coder / Qwen3.6). Each
<parameter> value is coerced to its declared JSON type using a new
ToolSchemas map built from the request's tools (string stays string,
integer/number/boolean/object/array coerced, mistyped values fall back
to string so an argument is never dropped). build_tool_schemas is
threaded into all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming).

Each loop also tracks emitted_tool_call and promotes the terminal
finish_reason from Stop to ToolCalls when a call parsed, so the OpenAI
chunk carries finish_reason:"tool_calls" and cortex maps it to Anthropic
stop_reason:"tool_use" — without which an Anthropic agent (Claude Code)
sees a tool_use block but stop_reason:end_turn and may not run the tool.
FinishReason::ToolCalls drops its dead_code allow.

Tests: JSON form still parses; Qwen-XML multi-param parse with
schema-driven string/integer/boolean coercion; no-schema type sniffing;
type-mismatch string fallback; unparseable body returns None.

Known gap (separate): the non-streaming run_inference paths have no
tool-call handling at all; Claude Code streams, so the streaming loops
are the ones that matter here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 10:39:38 +03:00
d662fa20ef fix(cortex): translate Anthropic tools to OpenAI shape + wire-debug logging
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Successful in 2m6s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 4m44s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
Claude Code (ANTHROPIC_BASE_URL -> cortex) hits POST /v1/messages, but
anthropic_to_openai forwarded the request's `tools` array verbatim via
the flattened `extra`. neuron feeds that straight into the HF chat
template, which iterates the OpenAI shape (tool.function.name/.parameters).
Anthropic-shaped tools ({name, description, input_schema}) rendered as
broken/empty definitions, the model improvised an unparseable
<tool_use_name>...</tool_use_name> tool-call format, neuron's
<tool_call>{json}</tool_call> detector missed it, and the markup fell
through as plain assistant text — so CC never received a structured
tool_use and the agent loop died.

Request-side translation now reshapes:
- tool definitions: {name, description, input_schema}
  -> {type:"function", function:{name, description, parameters}}
- tool_choice: auto->"auto", any->"required", none->"none",
  tool->{type:"function",function:{name}}
- assistant tool_use blocks -> OpenAI assistant.tool_calls
  (arguments JSON-stringified) — fixes multi-turn
- user tool_result blocks -> standalone role:"tool" messages keyed by
  tool_call_id
- system content blocks flatten to text instead of being JSON-serialised
  into the prompt; best-effort image-block -> image_url part

Wire-debug instrumentation (tracing levels only; cortex/neuron ship at
info, operator infra runs at debug):
- every handler emits a debug! "inbound request" line tagging the wire
  surface (anthropic | openai-chat | openai-responses | openai-completions)
  plus model/stream/tools and, for Anthropic, tool_history/system
- response side reports upstream_tool_calls + finish_reason, streaming
  and non-streaming
- full inbound + translated-upstream bodies at trace! (UTF-8-safe, capped)

Tests: 8 request-side unit tests + an end-to-end gateway test asserting
the upstream neuron receives OpenAI-shaped tools and a
user->assistant(+tool_calls)->tool->user history.

Also tighten script/infra-log-verbosity.sh: independent cortex/neuron
RUST_LOG args, cortex-only by default (neuron restart behind
--with-neuron so we don't needlessly cold-reload models), mkdir -p the
drop-in dir, symmetric RUST_LOG cleanup, and set -euo pipefail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:58:25 +03:00
d04f4ad704 feat(bench): show GPUs as the resource name instead of hostnames
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m54s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m15s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Public visitors don't know the hostnames, so surface each host's GPU(s)
as the resource name across the UI.

- store: gpu_label() turns the stored gpus_json into a compact label
  ("2× RTX 5090", "RTX 4090"); add `gpu` to ReportRow + RunRow and
  `host_gpus`/`model_gpus` maps to /api/dimensions (from each one's
  latest run). render_json gains gpu too.
- UI: Overview + Runs show a "GPU" column (gpu, fallback host); Runs'
  filter is now GPU-labelled (still filters by host underneath); Trends
  shows a "Measured on <gpu>" line for the selected model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:29:13 +03:00
e3879f093a feat(bench-ui): drop host selector from Trends; resolve host server-side
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m38s
build-prerelease / Test (push) Successful in 4m47s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m2s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
Public visitors don't know the hostnames or per-host hardware, so the
host picker on Trends was confusing. Select by model + scenario only;
/api/series now takes host as optional and resolves it to the host
serving that (model, scenario) — coherent since each model maps to one
host today. Runs (drill-down) keeps its host filter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:19:09 +03:00
e4b9b88de0 feat(bench-ui): mark the baseline↔live regime boundary on Trends
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Add a dashed vertical ReferenceLine at the first live build (labelled
"bench.py → helexa-bench") so the intentional gap between the gateway
baseline and the direct-to-neuron series reads as a deliberate
measurement-regime change, not missing data. The two series stay
unconnected by design (different regimes, not directly comparable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:13:34 +03:00
21db334e37 feat(bench-ui): overlay pre-helexa-bench baseline on Trends
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Option C: a curated static baseline (bench/src/baseline.ts), transcribed
from doc/benchmarks.md (8f6f1d3 + a1952a4 post-#11), overlaid on the
Trends charts as a dashed, clearly-labelled historical series ahead of
the bench era. Host inferred from model via the doc's fleet table;
ordered by snapshot time so it anchors the timeline.

Kept deliberately separate from the live series (no DB/API change) — the
baseline is a different regime (bench.py through the cortex gateway,
medians only) so it's never merged into the direct-to-neuron line; a
caption spells out the distinction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:02:43 +03:00
7dd1ddcfba fix(infra-setup): stat LE live dir via sudo; rsync provisioner secret for bench.internal issuance
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Failing after 11m1s
build-prerelease / Lint (fmt + clippy) (push) Has been cancelled
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build helexa-bench binary (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-bench RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
- cert_present() must `sudo test -d /etc/letsencrypt/live/...` (root-only
  0700); without sudo it falsely reported "no cert" and downgraded the
  bench.helexa.ai vhost to the http-only bootstrap (dropping its 443
  server). Now correctly keeps the full TLS vhost.
- bench.internal initial cert: rsync the operator's JWK 'lair' provisioner
  password to the host transiently (root, 0600), issue via
  step ca certificate, then remove it (trap + belt-and-suspenders rm).

Verified: bench.helexa.ai (LE) and bench.internal (lair CA) both serve the
SPA + /api→bob; step@bench.timer renews; secret removed from host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:40:38 +03:00
4ee7da4f97 feat(bench-ui): internal vhost bench.internal + step@ cert renewal
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Inside the WireGuard mesh, bench.helexa.ai dead-ends at the OPNsense LAN
interface (only WAN :443 is port-forwarded), so add an internal path:

- asset/nginx/bench.internal.conf — server_name bench.internal, internal
  "lair" CA cert, same SPA + /api→bob proxy. Mirrors the *.internal vhost
  convention on oolon.kosherinata.internal.
- asset/systemd/step@.{service,timer} — replicate oolon's smallstep cert
  renewal (step ca renew via mTLS, every 15 min, reload nginx).
- infra-setup.sh: install the step@ units + /etc/nginx/tls/{cert,key},
  install the vhost + enable step@bench.timer once the cert exists; prints
  the one-time issuance command otherwise.

Initial cert issuance (JWK provisioner) and bench.internal DNS are
operator steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:34:38 +03:00
db3cb95cbf fix(infra-setup): provision bench.helexa.ai cert via Cloudflare DNS-01 (ecdsa)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
The webroot/http-01 approach needed nginx serving :80, but the gateway's
nginx was dormant. Switch to the host's established convention —
certbot --dns-cloudflare --key-type ecdsa with /root/.certbot-internal —
which needs neither nginx nor :80, so the cert provisions independently
of the vhost being served. Also restorecon the webroot (SELinux
enforcing → nginx 403 without httpd_sys_content_t), and only ever
install the full TLS vhost once the cert exists (http-only bootstrap
otherwise) so `nginx -t` always passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:54:24 +03:00
37c19aa985 feat(bench-ui): public hosting at https://bench.helexa.ai via gateway nginx
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-blackwell (push) Successful in 1m32s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m25s
build-prerelease / Build cortex binary (push) Successful in 2m39s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m30s
build-prerelease / Test (push) Successful in 4m38s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
nginx on the gateway serves the bench SPA and reverse-proxies /api to the
bob bench API over WireGuard — public, auth-less, same-origin (no CORS),
internal API stays private.

- asset/nginx/bench.helexa.ai.conf (full TLS vhost: SPA + /api proxy) and
  a bootstrap http-only vhost for the initial ACME challenge.
- infra-setup.sh: one-time gateway setup — webroot, Let's Encrypt cert
  (certbot webroot, idempotent), install + enable the vhost.
- deploy.yml: deploy-bench-ui builds the SPA (setup-node) and rsyncs
  dist/ to /var/www/bench.helexa.ai every deploy; built same-origin so
  no VITE_API_BASE.
- cortex-host.conf: scoped gitea_ci rsync grant for the webroot.
- bench/README: production hosting notes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:40:29 +03:00
f50f5531cf feat(bench): read-only JSON API on bob + bench/ React visualisation app
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m21s
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Build helexa-bench binary (push) Successful in 2m44s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-ampere (push) Successful in 2m7s
build-prerelease / Build neuron-ada (push) Successful in 2m28s
build-prerelease / Build neuron-blackwell (push) Successful in 2m59s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
Part A — helexa-bench read API:
- [api] config (enabled, listen :13132); WAL on the store so API reads
  never block the sweep writer.
- store read methods: summary, series (chronological per-build medians),
  runs (filtered), dimensions, run_count.
- api.rs: axum /api/health|dimensions|summary|series|runs, permissive
  CORS (UI is a separate origin). The `run` daemon binds the API
  alongside the sweep; new `serve` subcommand serves API-only.
- listener plumbing (bench gains a port): data/helexa-bench-firewalld.xml,
  spec install, deploy-bench /api/health probe + firewalld step, sudoers
  firewall-cmd grants, [api] in example + bob.toml.
- 5 API tests + serve smoke.

Part B — bench/ Vite + React-SWC-TS app (router, react-bootstrap,
recharts): Overview (summary table), Trends (decode tok/s & TTFT across
build SHAs), Runs (filterable explorer). Typed API client with
VITE_API_BASE + dev proxy to bob. npm build/typecheck clean. Hosted
separately from the API (per design); .gitignore excludes node_modules/dist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:26:55 +03:00
5999c8a5a3 Merge branch 'feat/deploy-bench-on-bob' into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
ci(deploy): deploy helexa-bench to bob + enable all fleet services on boot

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:17:11 +03:00
66833890c0 ci(deploy): deploy helexa-bench to bob + enable all fleet services on boot
All checks were successful
CI / CUDA type-check (push) Successful in 2m9s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m8s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a deploy-bench job to deploy.yml that rolls helexa-bench onto bob
(the bench host, also running Agent Zero), following the deploy-cortex
pattern: manifest-gated skip-when-current, light "service stays active"
validation (outbound-only, no listener/model to probe), journal capture.
Runs alongside the cortex→neurons chain (no deploy-ordering dependency —
the sweep loop is version-aware).

Boot persistence: all systemd deployments now `systemctl enable --now`
instead of bare `start`, so cortex / neuron / helexa-bench come back
after a host reboot. Covers deploy.yml (all three services) and
deploy-dev.yml (neuron fast path); sudoers gain the matching
`enable --now <svc>` grant.

infra-setup.sh handles bob: provisions gitea_ci, installs the
bench-host sudoers, enables the lair-cafe-unstable repo (bob is a client
host without it), pre-creates /etc/helexa-bench, and syncs
asset/helexa-bench/bob.toml. New assets: bench-host.conf sudoers and
bob.toml (three neuron targets).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:10:07 +03:00
7bb20241a6 Merge branch 'feat/version-metadata-and-bench' into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 2m30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 2m38s
build-prerelease / Build helexa-bench binary (push) Successful in 3m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m53s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m14s
build-prerelease / Package cortex RPM (push) Successful in 1m16s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
feat(bench): version-aware benchmark harness + neuron build metadata

Adds GET /version build metadata to neuron and the helexa-bench crate — a continuous, version-aware harness that records fleet benchmarks into SQLite keyed by neuron build SHA, replacing manual bench.py runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:33:33 +03:00
42da25a37c feat(bench): version-aware benchmark harness + neuron build metadata
All checks were successful
CI / CUDA type-check (push) Successful in 1m36s
CI / Format (push) Successful in 31s
CI / Clippy (push) Successful in 2m47s
CI / Test (push) Successful in 4m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds automated, longitudinal performance tracking across neuron builds,
replacing manual script/bench.py runs and hand edits to benchmarks.md.

neuron build metadata + GET /version:
- cortex-core: shared BuildInfo type (build_info.rs).
- neuron build.rs captures git SHA (preferring injected HELEXA_BUILD_SHA,
  else git, else "unknown"), dirty flag, build timestamp, rustc version,
  profile, target, enabled cargo features, and best-effort candle-core
  version from Cargo.lock.
- New GET /version endpoint (version.rs) + clap --version long form.
- SHA injected in CI (build-neuron step) and helexa-neuron.spec
  (%{?helexa_commit}) so tarball RPMs report the real SHA. /version is
  now the canonical "which build is live" probe.

helexa-bench crate:
- Continuous daemon: hits each neuron directly on :13131, exercises each
  warm (status==loaded) model, records every run into a SQLite
  system-of-record stamped with the neuron's full BuildInfo.
- Version-aware: skips any (target, build SHA, model, scenario) cell
  already at samples_per_version, so a steady fleet costs only cheap
  /version + /models polls until a new SHA ships.
- Extensible Scenario trait; phase-1 chat-latency family ported verbatim
  from bench.py (synthetic 128/4096-tok prompts, /no_think, streamed
  TTFT + decode-window tok/s). `report` regenerates the benchmarks table.
- kind="openai" comparison targets scaffolded, not yet wired.

Packaging: data/helexa-bench.service (+ sysusers), prebuilt-binary RPM
spec (outbound-only, no firewalld), and build/package/publish wiring in
build-prerelease.yml with change detection.

Tests: cortex-core BuildInfo round-trip, neuron GET /version integration,
helexa-bench unit (prompt/SSE/config/store) + end-to-end sweep
(record -> skip -> resume on new SHA). Docs updated (benchmarks.md,
CLAUDE.md addendum).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:26:02 +03:00
30d50d6215 Merge pull request 'fix(ci): drop the unused flash-attn feature from neuron builds (#42)' (#46) from fix/42-drop-flash-attn into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m28s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m43s
build-prerelease / Build cortex binary (push) Successful in 2m45s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 2m5s
build-prerelease / Build neuron-ada (push) Successful in 3m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
2026-06-13 07:15:43 +00:00
9a312098dd fix(ci): drop the unused flash-attn feature from neuron builds (#42)
All checks were successful
CI / Format (push) Successful in 37s
CI / CUDA type-check (push) Successful in 2m9s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m56s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Format (pull_request) Successful in 34s
CI / CUDA type-check (pull_request) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (pull_request) Successful in 4m45s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The neuron fleet builds with `cuda cudnn flash-attn`, but nothing in
neuron uses flash-attn: the qwen3_5 (27B) arch is hand-rolled, the
candle-transformers qwen3 model has no flash path, llama is built with
use_flash_attn=false, and `grep flash crates/neuron/src` is empty. The
feature only pulls in candle-flash-attn's sm_80/sm_86 CUDA kernel
sweep — which is exactly where ptxas SIGSEGVs/hangs in #42 (3 hits in
one day, the last a ~4-hour hang that stalled the whole deploy behind
the ampere job).

Dropping the feature removes the #42 failure surface at the root (not
a mitigation) and cuts the longest, most fragile part of each flavour
build. No runtime change — nothing called those kernels. Removed from
all three flavour builds in build-prerelease.yml and from deploy-dev.yml;
ci.yml's cuda-check already used `--features cuda` only.

Closes #42

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:43:14 +03:00
98e9749f22 Merge pull request 'feat(neuron): speculative decoding — acceptance core + config (#25, phase 1)' (#45) from feat/25-speculative-decoding into main
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 5m22s
build-prerelease / Build neuron-blackwell (push) Successful in 10m12s
build-prerelease / Build neuron-ada (push) Successful in 14m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-13 06:39:49 +00:00
ec764a2cac feat(neuron): speculative decoding — acceptance core + config (#25, phase 1)
All checks were successful
CI / Format (push) Successful in 36s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Clippy (push) Successful in 2m38s
CI / Test (push) Successful in 4m21s
CI / Clippy (pull_request) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m37s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
First phase of speculative decoding: the pure, state-free acceptance
logic and per-target config, unit-tested in isolation before the
draft/verify loop and GDN-state rollback wire it into the generation
path.

greedy_accept walks the drafter's K proposed tokens against the
target's greedy token at each of the K+1 positions, accepting the
longest matching prefix and always committing one bonus token on top
(the target's correction at the first mismatch, or a free extra token
when the whole draft matched). So a round commits 1..=K+1 tokens —
never zero, guaranteeing forward progress even with a useless drafter.
Greedy is exact for temperature-0 (the fleet probe + #22 bench
regime); stochastic acceptance is a later phase.

SpeculativeConfig carries the drafter id (must share the target's
tokenizer — Qwen3.5-0.8B for the Qwen3.6-27B target, both qwen3_5,
byte-identical tokenizer, confirmed on beast) and the draft length K.

6 unit tests: full accept, partial accept, zero accept (progress
guarantee), last-position mismatch, single-token draft, config
gating. Not yet wired into the decode path — phase 2 (single-GPU
draft/verify) follows. Design + phasing on the issue.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:30:21 +03:00
4c1bdba31d Merge pull request 'feat(neuron): chunk the single-GPU vision prefill (parity with TP) (#18)' (#44) from feat/18-single-gpu-vision-chunked into main
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Test (push) Successful in 4m7s
build-prerelease / Build neuron-blackwell (push) Successful in 9m55s
build-prerelease / Build neuron-ada (push) Successful in 13m10s
build-prerelease / Build neuron-ampere (push) Has started running
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-13 05:23:07 +00:00
988ef5afc2 feat(neuron): chunk the single-GPU vision prefill (parity with TP) (#18)
All checks were successful
CI / Format (push) Successful in 31s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 1m27s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m38s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (pull_request) Successful in 4m13s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Test (push) Successful in 4m39s
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
The single-GPU vision path was still single-shot: a long vision-bearing
prompt to a single-GPU-loaded qwen3_5 had the OOM exposure the TP path
shed in fa01350 (it was only guard-rejected, never served).

Mirror TpQwen3_5ForCausalLM::prefill_with_images_chunked onto the
single-GPU Qwen3_5ForCausalLM: encode the image(s) once, walk the
pre-expanded prompt in prefill_chunk_tokens() windows splicing the
per-chunk <|image_pad|> rows, accumulate KV + GDN state across chunks
via the growing offset, keep the last chunk's logits. Interleaved
M-RoPE positions are computed once over the whole prompt and sliced
per chunk (an image compresses the position space, so per-chunk offset
arithmetic would be wrong) — so Qwen3_5Model::forward_inner gains an
explicit position_ids path alongside the internal-from-grids
(single-shot) and plain (text/decode) paths, plus a forward_with_positions
entry point. The device-worker ForwardLogitsWithImages handler now
calls the chunked method; chunk size comes from prefill_chunk_tokens()
on the worker thread, so the Job/handle surface and the callers are
unchanged.

The shared validate_vision_prefill VRAM/KV backstop stays (TP keeps it
too) — chunking bounds activation memory, not the accumulating KV
cache, so the guard still does useful work.

Verified on real weights (Qwen3.5-0.8B): extended the #15 vision
reference test to also run the chunked path with chunk_size=64 over the
217-token prompt (4 chunks; the ~196-token image-pad run spans them).
Chunked vs single-shot logits: cosine 1.000000, max_abs 0.0001;
argmax matches the HF reference. The test covers all three
forward_inner branches (text plain / single-shot vision / chunked
vision) on a real single-GPU qwen3_5 load.

Closes #18

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:17:11 +03:00
a1450789d2 Merge pull request 'docs(learnings): source-control P1 + P2 sprint learnings' (#43) from docs/learnings-p1-p2 into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 35s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-13 04:21:11 +00:00
2eaa776d85 docs(learnings): source-control P1 + P2 sprint learnings
All checks were successful
CI / Format (push) Successful in 35s
CI / Format (pull_request) Successful in 35s
CI / CUDA type-check (push) Successful in 1m32s
CI / CUDA type-check (pull_request) Successful in 1m37s
CI / Clippy (push) Successful in 2m30s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m27s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
doc/plan/* is gitignored, so the P1 learnings briefing could never be
committed. Move it to doc/learnings/p1.md (verbatim) and add
doc/learnings/p2.md capturing the P2 sprint (#11/#23/#1/#15).

The P2 doc's headline: CI green != correct. Four correctness bugs
passed every CI gate and surfaced only on the live fleet (post-gen
snapshots never re-match reasoning models; full-prompt snapshots
break on BPE retokenization; the chunked delta-rule's nilpotent-
squaring shortcut NaNs on correlated keys; the 0.8B masked two of
these by luck). Plus the device-worker/TP state patterns, the
deploy-dev + systemd-drop-in A/B loop, the per-package change-
detection fleet-split failure mode (#42), and the f32-fixture
numerical-validation rig (#15).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 07:13:36 +03:00
7918995e5a chore(ci): retrigger build-prerelease — ampere ptxas segfault (flash-attn sm_86, runner-side) on 538cc87
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m12s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m57s
build-prerelease / Build neuron-blackwell (push) Successful in 9m5s
build-prerelease / Build neuron-ampere (push) Successful in 15m3s
build-prerelease / Build neuron-ada (push) Successful in 19m4s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m13s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m50s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s
2026-06-13 00:12:24 +03:00
538cc87572 Merge pull request 'feat(neuron): numerical validation against the transformers reference (#15)' (#41) from feat/15-numerical-reference into main
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Build cortex binary (push) Successful in 2m43s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m39s
build-prerelease / Build neuron-ampere (push) Failing after 8m54s
build-prerelease / Build neuron-blackwell (push) Successful in 10m11s
build-prerelease / Build neuron-ada (push) Successful in 14m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
2026-06-12 20:43:37 +00:00
1c4b53cbf1 feat(neuron): numerical validation against the transformers reference (#15)
All checks were successful
CI / Format (push) Successful in 44s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 2m23s
CI / Clippy (pull_request) Successful in 2m24s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m1s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
script/dump_reference.py captures fixtures from the HF qwen3_5
implementation (token ids + reference tensors, f32 by default so the
comparison pins math rather than dtype noise);
tests/numerical_reference.rs replays them through our arch and
asserts argmax equality, cosine similarity, and max-abs ceilings. The
tests self-skip without NEURON_REF_MODEL_PATH so CI stays green
without weights.

Measured on beast (f32-vs-f32): text logits max_abs 0.000 / cosine
1.000000 (the >64-token prompt routes through the chunked GDN
prefill, so the production prefill math is what's validated); vision
tower cosine 0.999998, end-to-end vision logits cosine 1.000000 with
identical argmax. Mutation sensitivity: NEURON_VISION_LEGACY_POS=1
collapses tower cosine to 0.75 and fails loudly.

One production fidelity fix the harness surfaced: the pos-embed
bilinear blend now accumulates in f32 and casts once at the end,
matching the reference (we previously rounded the weights to bf16
before blending).

Fixtures: 0.8B text + vision (f32), 27B text (bf16 — an f32 27B
forward needs ~108 GB; the automated comparison runs against the
0.8B, which executes the same arch modules). Regeneration documented
in tests/fixtures/numerical/README.md.

Closes #15

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:35:57 +03:00
49a8dbcd28 Merge pull request 'perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)' (#40) from perf/1-parallel-isq into main
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m21s
build-prerelease / Test (push) Successful in 3m56s
build-prerelease / Build neuron-blackwell (push) Successful in 9m38s
build-prerelease / Build neuron-ada (push) Successful in 14m21s
build-prerelease / Build neuron-ampere (push) Successful in 19m0s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m12s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m29s
2026-06-12 20:12:44 +00:00
90e971dcf5 perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 35s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (pull_request) Successful in 2m18s
CI / Clippy (push) Successful in 2m46s
CI / Test (push) Successful in 5m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 5m33s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
QTensor::quantize runs its per-block math strictly sequentially on
one core (CUDA storage round-trips through the same CPU path), which
made Q6K ISQ the dominant phase of the 27B TP cold load. Blocks are
independent, so quantize_parallel re-implements the same encoding
through candle's public per-block API (k_quants::GgmlType::from_float)
with rayon fanning blocks across the CPU pool — byte-identical output,
pinned by parity tests against QTensor::quantize for Q6K/Q5K/Q4K/Q8_0.

Threading discipline holds: the device-to-host read and the
QStorage::from_data upload stay on the calling thread (device worker /
subprocess main); rayon workers touch host memory only.

Also adds the per-phase timing the issue asked for first: per-layer
debug + layer-loop total + lm_head info lines, so the next cold load
shows where the time actually goes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:47:57 +03:00
92273eb936 chore(ci): retrigger build-prerelease — ampere/blackwell packaging skipped after transient build failure on 128b381
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m10s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m54s
build-prerelease / Build neuron-blackwell (push) Successful in 9m36s
build-prerelease / Build neuron-ada (push) Successful in 14m6s
build-prerelease / Build neuron-ampere (push) Successful in 19m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
2026-06-12 22:38:31 +03:00
128b3818cb Merge pull request 'perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)' (#39) from perf/23-chunked-gdn-prefill into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m57s
build-prerelease / Build neuron-blackwell (push) Successful in 10m3s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 14m3s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
2026-06-12 18:44:22 +00:00
812d191e50 fix(neuron): UT transform by forward substitution, not nilpotent squaring
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 53s
CI / CUDA type-check (push) Successful in 1m52s
CI / CUDA type-check (pull_request) Successful in 2m12s
CI / Clippy (push) Successful in 2m18s
CI / Clippy (pull_request) Successful in 2m36s
CI / Test (push) Successful in 4m18s
CI / Test (pull_request) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Live A/B on beast produced NaN logits ("!!!" replies) on real prompts:
the nilpotent-squaring form of (I - T)^-1 computes raw powers of T,
whose entries grow combinatorially (path counts ~ C(62,31)) before
nilpotency collapses them — fine on uncorrelated test data, f32
precision death on real prompts whose repetitive text makes keys
highly correlated. The reference's forward-substitution loop never
forms raw powers; its intermediates are the convergent M entries.

Port the reference loop faithfully (rows accumulate into a fresh
tensor). New adversarial parity test with near-identical keys and
beta ~= 1 diverges to 8e30 under the squaring form and passes under
forward substitution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:18:32 +03:00
2a9def6d2d perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 24s
CI / CUDA type-check (push) Successful in 1m38s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m29s
CI / Test (pull_request) Successful in 4m21s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Prefill (seq_len >= 64) now runs the chunk-parallel gated delta rule
ported from the HF reference torch_chunk_gated_delta_rule
(chunk_size=64): identical math reorganised into per-chunk batched
matmuls (cuBLAS/tensor cores on CUDA, gemm on CPU) instead of the
O(L)-sequential per-token recurrence. Decode steps and short prompts
keep the recurrent paths (CUDA kernel / Rust loop) unchanged.

One deliberate deviation from the reference: its in-place row-by-row
UT-transform computes (I - T)^-1 - I by forward substitution; T is
strictly lower triangular and therefore nilpotent at chunk size 64,
so the same inverse is the product of six squarings
prod_{j=0..5}(I + T^(2^j)) — batched matmuls instead of 63 sequential
row updates, which suits candle's immutable tensors. Chunk-local math
runs rank-3 over a flattened B*H*N batch dim (candle matmul supports
at most two batch dims).

Initial-state continuation is supported, so chunked prefill composes
with #11's restored prefix snapshots. Both single-GPU and TP paths
pick this up through the shared run_delta_rule dispatch.
NEURON_GDN_CHUNKED=0 forces the recurrent paths for A/B measurement.

Parity tests pin chunked against recurrent (2e-4 abs) across padding
(L=130), exact multiples with non-zero initial state (L=128 after a
50-token prefix), and a single exact chunk.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:51:51 +03:00
ddb331e1a3 Merge pull request 'docs(bench): record post-#11 fleet numbers' (#38) from docs/benchmarks-post-11 into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m47s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-12 17:14:00 +00:00
df0bf4c518 docs(bench): record post-#11 fleet numbers
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 1m31s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (push) Successful in 2m24s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m24s
CI / Test (pull_request) Successful in 3m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Appends the 2026-06-12 post-prefix-cache run: 27B @4k warm TTFT
7.07 s -> 1.43 s, no-cache control models unchanged, with a
methodology note that repeated-prompt cells now measure warm TTFT on
qwen3_5-arch models.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:06:53 +03:00
a1952a4522 Merge pull request 'fix(neuron): snapshot at the last special-token boundary (#11)' (#37) from fix/11-snapshot-cut-retokenization into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m21s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 10m22s
build-prerelease / Build neuron-ampere (push) Successful in 13m8s
build-prerelease / Build neuron-ada (push) Successful in 21m31s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
2026-06-12 16:24:15 +00:00
4f266dbd82 fix(neuron): snapshot at the last special-token boundary (#11)
All checks were successful
CI / Format (push) Successful in 42s
CI / Format (pull_request) Successful in 34s
CI / CUDA type-check (push) Successful in 1m31s
CI / Clippy (push) Successful in 2m19s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Test (push) Successful in 4m13s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m9s
CI / Test (pull_request) Successful in 4m5s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Second finding from live 27B validation: prompt-covering snapshots
still never matched. The rendered prompt ends with
`<|im_start|>assistant\n`, and when the next turn re-tokenizes that
text followed by the assistant's reply, BPE merges the trailing
newline with the reply's first characters — the final token(s) of the
cached sequence differ from the next prompt's, so the exact-prefix
match never fires. (A reply starting with an atomic special token
like <think> masks this, which is why the 0.8B check passed.)

Snapshot one past the last <|im_start|> instead: special tokens are
hard segmentation points, so ids up to and including it are provably
identical across renders. Prefill pauses at that boundary to capture
the snapshot, then finishes the ~2-token `assistant\n` tail. Applied
to all six request paths; unit tests for the cut helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:16:45 +03:00
43a6d96d5f Merge pull request 'fix(neuron): snapshot prefix cache at the prefill boundary (#11)' (#36) from fix/11-prefix-snapshot-at-prefill into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 4m2s
build-prerelease / Build neuron-ampere (push) Successful in 13m22s
build-prerelease / Build neuron-blackwell (push) Successful in 13m31s
build-prerelease / Build neuron-ada (push) Successful in 14m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m12s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m50s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
2026-06-12 15:34:59 +00:00
3fd1989b2b fix(neuron): snapshot prefix cache at the prefill boundary (#11)
All checks were successful
CI / Format (push) Successful in 41s
CI / Format (pull_request) Successful in 42s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 3m10s
CI / Clippy (pull_request) Successful in 3m3s
CI / Test (pull_request) Successful in 4m2s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Live validation on beast's Qwen3.6-27B showed reused=0 on every turn:
the post-generation snapshot includes reasoning tokens (<think>...)
that get stripped when the client echoes the assistant message back,
so the cached sequence is never a token-prefix of the next prompt.
quadbrat's 0.8B only matched because its think block round-tripped as
literal text.

Snapshot after prefill instead (covering exactly the prompt tokens) —
that is the state the next turn provably extends under a stable chat
template, regardless of how reasoning or tool-call content is
transformed on echo. Taken after the first healthy sample so
NaN-poisoned prefills never cache their state; this also retires the
forwarded-token bookkeeping and the consumer-hangup store sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 18:29:00 +03:00
f7952547e7 Merge pull request 'feat(neuron): prefix KV caching for the TP path (#11)' (#35) from feat/11-prefix-kv-cache-tp into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m12s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m58s
build-prerelease / Build neuron-blackwell (push) Successful in 9m5s
build-prerelease / Build neuron-ada (push) Successful in 14m22s
build-prerelease / Build neuron-ampere (push) Successful in 19m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m51s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
2026-06-12 14:49:19 +00:00
7e66f77851 fix(neuron): CUDA type-check fixes for TP prefix cache
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 39s
CI / CUDA type-check (pull_request) Successful in 1m26s
CI / CUDA type-check (push) Successful in 1m34s
CI / Clippy (push) Successful in 3m14s
CI / Clippy (pull_request) Successful in 3m18s
CI / Test (push) Successful in 5m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Two errors only the cuda config surfaces: the TpSnapshotKv dispatch
arms mixed candle and anyhow error types, and restore_or_clear_tp held
the registry MutexGuard across the cleanup await inside a let-chain
(making the TP request futures non-Send). Bind the removed ref before
awaiting, same discipline as the other lock sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:39:32 +03:00
e629e1872c feat(neuron): prefix KV caching for the TP path (#11)
Some checks failed
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Failing after 1m55s
CI / CUDA type-check (pull_request) Failing after 1m47s
CI / Clippy (push) Successful in 2m11s
CI / Test (push) Successful in 4m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m0s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Extends the prefix cache to tensor-parallel models — Qwen3.6-27B on
beast, where the TTFT win is largest. Closes #11.

Every rank holds its shard's snapshot under one pool-minted id: the
leader's lives in the device worker beside the TP slab
(Job::TpSnapshotKv / TpRestoreKv / TpDropKvSnapshot), each subprocess
rank stores its own in-process via new WorkerRequest variants
(SnapshotKvCache / RestoreKvCache / DropKvSnapshot). Shard state has
the same shape as single-GPU (attention ConcatKvCache + GDN
conv/recurrent state + rope_delta), so the snapshot types are reused;
all ranks sit at the same token boundary because step fan-out is
synchronous.

Consistency on partial failure: a failed restore falls back to
clear-all-ranks + full prefill (and drops the entry); a failed
snapshot drops the id on every rank so nothing half-stored leaks.
DropTp / UnloadModel invalidate a model's snapshots with it, covering
auto-recovery. Vision requests bypass as on single-GPU. Budget
accounting uses leader bytes x world_size (shards are symmetric).

Wired into both TP request paths (non-streaming inner + streaming
orchestration task); chunked_prefill_tp gains the restored-offset
start.

Closes #11

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:34:49 +03:00
bb558451db Merge pull request 'feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)' (#34) from feat/11-prefix-kv-cache into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m0s
build-prerelease / Build neuron-blackwell (push) Successful in 9m44s
build-prerelease / Build neuron-ampere (push) Successful in 12m47s
build-prerelease / Build neuron-ada (push) Successful in 19m6s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m2s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m8s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
2026-06-12 14:20:24 +00:00
c5378d532d feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 34s
CI / Clippy (push) Successful in 2m29s
CI / CUDA type-check (pull_request) Successful in 1m31s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m32s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Stop discarding cache state between requests. When an incoming
prompt's token sequence starts with the exact tokens of a stored
snapshot, restore it and prefill only the divergent suffix.

For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v
+ GatedDeltaNet conv/recurrent state + the rope_delta counter, all at
one token boundary; the recurrent state cannot rewind, so matching is
exact-prefix only. GDN states are deep-copied both directions (the
CUDA delta-rule kernels mutate the state buffer in place); attention
k/v snapshots share storage safely (append-by-cat never mutates).

Snapshots live in the device worker's state next to the model slab
(Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds
only an opaque id + token sequence + byte size. DropArch drops a
model's snapshots with it, so unload and auto-recovery invalidate for
free. CPU loads hold snapshots inline on the legacy path.

Per-model LRU registry (harness/prefix_cache.rs) bounded by
[harness.candle.prefix_cache] budget_mb / max_entries, enabled by
default; inserting a snapshot drops entries it strictly extends.
Vision requests and candle-transformers archs bypass the cache
entirely (clear-every-request, unchanged).

Covers the single-GPU worker path (streaming + non-streaming) and the
CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR
that closes #11 with before/after bench numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:14:07 +03:00
9f383e7bc7 Merge pull request 'feat(gateway): Anthropic streaming SSE translation (#24)' (#33) from feat/gateway-24-anthropic-sse into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Test (push) Successful in 3m59s
build-prerelease / Build cortex binary (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m24s
build-prerelease / Build neuron-ampere (push) Successful in 19m3s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m59s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
2026-06-12 12:57:09 +00:00
569c528c4b feat(gateway): Anthropic streaming SSE translation (#24)
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m25s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 2m9s
CI / Clippy (pull_request) Successful in 2m45s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m29s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The /v1/messages handler translated request envelopes but proxied raw
OpenAI SSE frames back to streaming Anthropic clients — the gap
between the README's "point your tooling at it once" contract and
what Claude Code actually received.

cortex-core gains AnthropicStreamTranslator, a pure per-stream state
machine: OpenAI chunks in, ordered (event, payload) pairs out —
message_start → content_block_start/delta/stop (text and tool_use
blocks, indexed; tool_calls map to input_json_delta) → message_delta
(stop_reason mapped via the now-shared map_stop_reason, which also
teaches the non-streaming path tool_calls→tool_use) → message_stop.
Without an upstream usage frame the output count falls back to the
delta count (engine-exact for neuron's one-chunk-per-token streams,
#31); with one, input/output tokens ride message_delta.

cortex-gateway gains anthropic_sse: the wire pump that splits the
upstream byte stream into SSE events, parses data: payloads
(leniently — engines omit fields on special frames), feeds the
translator, and frames results as `event:`/`data:` pairs through a
bounded channel (slow client back-pressures the upstream read).
Upstream truncation without [DONE] still closes the Anthropic event
sequence. Nothing is buffered beyond the current event's bytes.

Tests: 5 state-machine unit tests (text flow, stop-reason mapping +
defaults, tool_use blocks, usage propagation, idempotent finish) and
2 gateway integration tests (full event sequence + text reassembly,
usage propagation into message_delta). Validated end-to-end by
running this branch's gateway against a production neuron and
streaming a live Anthropic request.

Closes #24

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:47:30 +03:00
06e4ffc25c Merge pull request 'feat(bench): reproducible benchmark harness + first fleet numbers (#22)' (#32) from feat/22-benchmark-harness into main
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m41s
build-prerelease / Package cortex RPM (push) Successful in 1m29s
build-prerelease / Test (push) Successful in 4m44s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:46:33 +00:00
a2e73a8907 feat(bench): reproducible batch-1 benchmark harness + first fleet numbers (#22)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m8s
CI / CUDA type-check (pull_request) Successful in 2m8s
CI / Clippy (push) Successful in 2m23s
CI / Test (pull_request) Successful in 3m54s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
script/bench.py: stdlib-only, works against any OpenAI-compatible /v1
endpoint (helexa, llama.cpp, Ollama, vLLM) so cross-engine tables are
a concatenation via the --label column. Measures the operator-felt
trio per (model, prompt-size) cell: TTFT (first SSE content chunk),
decode tok/s (visible tokens over the first→last chunk window,
chunk-per-token engine invariant since streaming usage frames aren't
emitted yet — #31), total wall-clock. Medians over N runs after one
warmup; append-only JSONL for longitudinal tracking.

Measurement traps found against the live fleet and handled:
- thinking models burn the budget invisibly (reasoning deltas are
  off-wire by default) — the prompt appends Qwen's /no_think soft
  switch
- short coalesced replies collapse the decode window to one TCP read
  — rates require a ≥200 ms window and the prompt demands ~300 words

doc/benchmarks.md: method, fleet table, and the first published
numbers (2026-06-12, 8f6f1d3): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s,
27B@2×5090 Q6K TP=2 35 tok/s with flat decode from 128→4k context —
and the 7.1 s 4k-prefill TTFT recorded as #23's before-number.

Refs #22 (competitor baselines still pending — the harness is ready
for them)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:39:13 +03:00
8f6f1d3205 feat(deploy): validate neuron capability after every deploy
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 10m36s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 6m35s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
A deploy previously went green the moment systemd reported the
service started — a merge that broke model loading or inference
itself would deploy "successfully" and only surface when a human
noticed. Each neuron deploy now earns its green:

1. Wait for default models: poll /health until activation.state is
   ready, with per-host timeouts in the matrix (beast 900s for the
   27B Q6K TP=2 cold-load, benjy/quadbrat 300s). Any entry in
   activation.failed fails the deploy with the per-model error —
   the structured equivalent of watching the journal for
   "loaded default model", plus failure detail the journal line
   can't carry.
2. LLM smoke probe: ask the first loaded model to reply with one
   specific word (max_tokens 512 so thinking models have room,
   temperature 0) and grep the response for it. Not a quality bar —
   just proof the deploy didn't lobotomize inference.

Hosts whose package is already current still skip everything — the
validation cost is only paid when a restart actually happened. The
probe was dry-run against benjy's production neuron before landing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:28:20 +03:00
b0d0b939af Merge pull request 'feat(gateway): per-request token metrics — TTFT and tok/s (#21)' (#30) from feat/gateway-21-token-metrics into main
Some checks failed
build-prerelease / Lint (fmt + clippy) (push) Blocked by required conditions
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:25:32 +00:00
6a36d15ef1 feat(gateway): per-request token metrics — TTFT and tok/s (#21)
All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.

The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.

Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
  decode window (first→last chunk); falls back to total request
  duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
  (counters)

The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).

Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).

Closes #21

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:11:52 +03:00
b463439416 Merge pull request 'feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)' (#29) from feat/neuron-19-driver-preflight into main
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-blackwell (push) Successful in 10m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:08:20 +00:00
716558c8ff feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m13s
CI / Clippy (pull_request) Successful in 2m37s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The un-rebooted driver update (userspace libs bumped, kernel module
still old) kills every CUDA call on the host including nvidia-smi,
and neuron surfaced it only as `Comm::from_rank ... NcclError` deep
inside the first model load — 30 minutes of forensics on beast
(2026-06-08) to diagnose. Make it instantly legible instead:

- discovery distinguishes nvidia-smi absent (CPU-only, fine) from
  present-but-failing, classifies the "Driver/library version
  mismatch" signature, and pairs the userspace NVML version with the
  loaded kernel-module version from /proc/driver/nvidia/version.
- DiscoveryResponse gains `cuda_unavailable_reason` (omitted when
  None — wire-compatible) so cortex can see why the node has no
  devices and route around it.
- startup logs one loud ERROR line with the actionable reason
  ("reboot the host to reload the kernel module") and skips default
  model loads entirely, marking each failed with that reason so
  /health activation shows the real cause.
- POST /models/load fast-rejects with 503 + code=cuda_unavailable on
  a mismatch host instead of dying minutes later in cuInit/NCCL.

No false positives: other nvidia-smi failures (no devices, perms)
keep their existing behaviour, CPU-only hosts stay silent.

Closes #19

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:00:00 +03:00
112e4e124a fix(ci): export RUSTC_WRAPPER in the build step itself — GITHUB_ENV doesn't propagate
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 3m50s
build-prerelease / Build neuron-blackwell (push) Successful in 10m10s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ada (push) Successful in 14m29s
build-prerelease / Build neuron-ampere (push) Successful in 14m31s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Run 375 proved the CUDA image ships sccache (probe step printed
"sccache enabled") but the wrapper never reached cargo: the runner
does not propagate GITHUB_ENV across steps, so the builds ran
unwrapped (server stats: 4 compile requests for a ~600-crate build,
durations unchanged). Probe and export inside the build step's own
shell instead, in both build-neuron and ci.yml's cuda-check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:50:25 +03:00
dc6feec6dc fix(deploy): gate on the publish manifest, not unprivileged dnf check-update
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m20s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 9m46s
build-prerelease / Build neuron-ampere (push) Successful in 13m57s
build-prerelease / Build neuron-ada (push) Successful in 15m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
The f5fa840 deploy exposed both failure modes of gating with
`dnf check-update` as the gitea_ci user in one run: it hung
indefinitely on quadbrat (blocked process, 0 CPU, killed manually),
and on benjy/beast it silently reported "no updates" two minutes
after new RPMs were published — both hosts skipped a real (luckily
binary-identical) update.

Gate with data we own instead: fetch packages.json from
rpm.lair.cafe (plain curl, no privileges, no dnf locks), take the
newest release per package by buildTime, and skip the
stop/upgrade/start cycle only when it exactly equals
`rpm -q %{VERSION}-%{RELEASE}`. Unreachable or unparsable manifest
fails open to a full deploy. The dnf transaction itself still runs
under the scoped sudoers rules, unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:20:21 +03:00
02f20bc9e1 Merge pull request 'feat: keep auto-recovering models visible as recovering (#20)' (#28) from feat/neuron-20-recovering-status into main
Some checks failed
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m39s
build-prerelease / Build cortex binary (push) Successful in 3m46s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 11:15:38 +00:00
2a231e49de merge main (sccache enablement supersedes branch cuda-check pin)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m17s
CI / CUDA type-check (push) Successful in 2m39s
CI / CUDA type-check (pull_request) Successful in 2m30s
CI / Test (push) Successful in 4m51s
CI / Clippy (pull_request) Successful in 2m12s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m49s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
# Conflicts:
#	.gitea/workflows/ci.yml
2026-06-12 14:05:55 +03:00
2dadea5d8d ci: enable sccache on the build jobs (conditional on the CUDA image)
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m57s
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The 3 CUDA flavour builds (10-14 min each, the critical path of every
full run) and build-cortex compiled entirely uncached. With the
gongfoo-side sccache hardening in place, wire them up:

- build-cortex: full sccache env (rust image ships it) + the standard
  escalation loop (retry -> server restart -> uncached final attempt).
- build-neuron: probe for sccache before enabling the wrapper — the
  CUDA image may not ship it, and a missing binary must degrade to an
  uncached build, not fail cargo at `sccache rustc -vV` (the original
  reason the wrapper was cleared here). rustc compilations are shared
  across all three flavours; candle-kernels' nvcc output stays
  uncached (build-script artifact).
- ci.yml cuda-check: same probe pattern replaces the blanket env
  clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer
  ships nvidia-smi for candle-kernels' fallback detection (mirrors
  9bb9678 on the #20 branch).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:05:26 +03:00
9bb9678f93 fix(ci): pin CUDA_COMPUTE_CAP in cuda-check — builder image has no nvidia-smi
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 1m45s
CI / Clippy (push) Successful in 2m24s
CI / Clippy (pull_request) Successful in 2m19s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m35s
CI / CUDA type-check (pull_request) Successful in 1m50s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
candle-kernels' build script shells out to nvidia-smi for compute-cap
detection when CUDA_COMPUTE_CAP is unset; the current GPU-less builder
image doesn't ship it, so the type-check died in the build script
before borrow-checking anything. Pin an arbitrary valid cap — the
check is feature-gate compilation only; real caps live in
build-prerelease.yml's flavour matrix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:55:23 +03:00
df9c490614 feat(neuron+gateway): keep auto-recovering models visible as recovering (#20)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (pull_request) Failing after 28s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m54s
CI / Clippy (pull_request) Successful in 3m36s
CI / Test (push) Successful in 4m37s
CI / Test (pull_request) Successful in 5m20s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / CUDA type-check (push) Failing after 31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
During the #17 auto-recovery window (unload → reload, minutes for a
large TP model) the model's registry slot is absent, so it vanished
from neuron's /models — and cortex, routing by /models presence,
answered "model not found on any node" while a direct request to
neuron would have correctly said "recovering, retry shortly".

neuron: the recovery set becomes a map carrying a devices/capabilities
snapshot taken at trigger time (while the registry slot still exists).
list_models reports `recovering` for models in the set — both while
the poisoned slot is still present and during the reload gap, where
the snapshot keeps the model listed.

gateway: ModelStatus grows a Recovering variant (parsed from the
wire); the router holds the route — new RouteError::ModelRecovering
mapped to 503 instead of 404 — and deliberately does not fall through
to the catalogue cold-load, which would race a second placement
against the in-flight recovery. The evictor already ignores
non-Loaded entries.

Tests: neuron unit test (recovering model stays listed with snapshot),
gateway integration tests (poller parses `recovering`; request gets
503 retry-shortly and the model stays on /v1/models).

Closes #20

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:42:03 +03:00
f5fa840dfb ci: escalate sccache retries — restart server, then fall back uncached
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m6s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 3m45s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Build neuron-ampere (push) Successful in 14m13s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m28s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m54s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Run 361's Test job failed all 3 attempts with the sccache
dead-server signature (sccache fatal error, ENOENT on its own tmp
files under target/debug/deps). Retrying the same invocation only
helps for transient races; against a wedged server every same-VM
retry fails identically — and under the new pipeline that blocks
publish and the deploy behind it.

Escalate instead: attempt 1 plain, attempt 2 after an sccache server
restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache
now costs build minutes, never the deploy. Applied to the lint/test
jobs in build-prerelease.yml and ci.yml alike.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:24:02 +03:00
7557c5e877 ci: cut iteration latency — change-aware builds, gated deploys, dev fast path
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 28s
build-prerelease / Test (push) Failing after 1m16s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m7s
build-prerelease / Build cortex binary (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Push-to-testable was ~20.5 min for every commit (measured on the
2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of
what changed. Three structural fixes:

- build-prerelease: a change-detection step in `prepare` diffs HEAD
  against the git sha embedded in the last *published* unstable RPM
  (per package, from packages.json) and skips builds whose inputs
  didn't change. Docs-only commits build nothing; gateway-only
  commits skip the 3 CUDA flavour builds. Detection failures fall
  open to a full build.
- ci.yml no longer runs on pushes to main; fmt/clippy/test live in
  build-prerelease as parallel jobs gating publish. The two workflows
  previously queued against each other on the same runner labels,
  delaying the cortex build ~12 min. Branches, PRs, and tags keep the
  full ci.yml gate.
- deploy: each host self-gates with `dnf check-update` and leaves the
  service untouched when the installed package is already current —
  no more neuron restarts (and 27B cold-loads) for commits that
  didn't change neuron.
- deploy-dev (new): manual single-host fast path — build one CUDA
  flavour, scp the binary, restart the service. Skips packaging,
  signing, publish, and dnf entirely. Backed by a new exact-form
  sudoers rule in asset/sudoers.d/neuron-host.conf (already applied
  to all three hosts).

Expected loop times when runners behave: docs ≈ 1 min (nothing
deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min,
full fleet ≈ 13-15 min.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:17:22 +03:00
91e95ca979 docs: rewrite README around project positioning
Some checks failed
CI / CUDA type-check (push) Failing after 46s
CI / Format (push) Successful in 47s
CI / Clippy (push) Successful in 2m53s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 39s
build-prerelease / Build cortex binary (push) Successful in 3m52s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 11m34s
build-prerelease / Build neuron-ampere (push) Successful in 15m31s
build-prerelease / Build neuron-ada (push) Successful in 15m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Lead with what helexa is for — near-frontier open-weight models on
consumer hardware you own — instead of a feature list. Adds the scope
section (intentional divergence from vLLM/SGLang; CUDA-only today as a
test-coverage constraint, not a principle), an engine section covering
the per-device worker threads and consumer-GPU tensor parallelism, the
previously-missing helexa-acp crate, and a status section pointing at
git.lair.cafe as the source of truth with GitHub as read-only mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 11:37:00 +03:00
1a74cb0c56 chore: rename repo cortex -> helexa
Some checks failed
CI / CUDA type-check (push) Failing after 30s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Failing after 31s
build-prerelease / Build neuron-ada (push) Failing after 34s
build-prerelease / Build neuron-ampere (push) Failing after 38s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Clippy (push) Failing after 1m11s
build-prerelease / Build cortex binary (push) Successful in 3m47s
CI / Test (push) Successful in 5m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
helexa is the project; cortex (per-operator control plane / LLM proxy)
and neuron (per-host LLM harness) are its components. The Gitea repo
is now helexa/helexa. Update repository URLs in Cargo metadata, RPM
specs, and docs; make the CI changelog push URL rename-proof via the
github.repository context; reframe README.md and CLAUDE.md around the
project name. Binary, package, service, and config-path names are
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 10:54:01 +03:00
60f5598542 build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync)
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 35s
CI / Test (push) Failing after 1m9s
CI / Clippy (push) Successful in 2m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Build neuron-ampere (push) Successful in 7m35s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's
thread-safety invariant upstream) and makes `Comm::abort` idempotent via
an `aborted` flag (so abort-then-Drop can't double-free) — strictly
better than the previous Drop-no-panic workaround, and the `abort()`
signature is unchanged so the watchdog call site is unaffected.

Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` /
`NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119)
with neuron's manual `unsafe impl`s. Remove the four now-redundant impls
— the safety assertion lives upstream in cudarc where it belongs. The
conflict is in cuda-gated code, so only the CUDA type-check catches it
(non-cuda build + clippy + tests stay green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:33:14 +03:00
7945240646 chore: re-trigger deploy (#17 Stage 2, attempt 3)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m45s
build-prerelease / Build neuron-blackwell (push) Successful in 5m50s
CI / Test (push) Successful in 6m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m38s
build-prerelease / Build neuron-ada (push) Successful in 5m36s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
No code change. Each deploy run, the degraded CI runner kills a different
single arch build (blackwell, then ada) ~fast, and the all-arch-gated
packaging skips → no publish. Every arch HAS built green across runs
(blackwell  in 342, ampere , ada  in 339) and the gate + CUDA
type-check pass. Re-running to catch all three green in one run so the
Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate
infra work).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:06:04 +03:00
0c74d89d15 chore: re-trigger deploy (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Failing after 51s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 6m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m42s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Test (push) Successful in 6m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
No code change. The c94a2ae deploy's neuron-blackwell build died ~12min
into the Blackwell kernel compile on the degraded runner, while
neuron-ampere + neuron-ada built the identical Rust + patched cudarc
cleanly and the CUDA type-check passed. Transient infra; re-running to
get a healthy blackwell build so the RPMs publish and beast (Blackwell)
picks it up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:45:16 +03:00
c94a2ae755 fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 44s
build-prerelease / Build cortex binary (push) Successful in 4m57s
build-prerelease / Package cortex RPM (push) Successful in 1m36s
CI / Test (push) Successful in 7m10s
CI / Clippy (push) Failing after 1m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 9m5s
build-prerelease / Build neuron-blackwell (push) Failing after 12m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
`super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state`
(nonexistent); the module is the child `nccl_state` (cf. the existing
`nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the
non-cuda build couldn't catch it; the branch CUDA type-check flaked on the
runner before compiling. Self-audited fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:21:43 +03:00
99920dd322 feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Failing after 47s
CI / Format (push) Successful in 31s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Make a hung NCCL collective recoverable instead of a permanent brick.
Today a wedged collective hangs the in-process leader thread forever, and
even Stage 1's recovery can't help — its unload's DropTp queues behind the
stuck thread and hangs too.

- Cache the leader's NCCL Comm handle async-side at init (new cuda-gated
  Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on
  WorkerPool.leader_comm). Fetched while the thread is responsive — a
  wedged thread can't service the fetch, which is why it's cached up front.
- Wrap the leader forward in both generate_step and
  generate_step_with_images in tokio::time::timeout (default 120s,
  NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls
  Comm::abort() (ncclCommAbort) on the cached handle from the async
  thread — the one NCCL op sanctioned concurrently with an in-flight
  collective — which unblocks the leader thread, then fails the step
  WITHOUT draining (workers are wedged too; recovery's unload kills them).
  The error is a device fault → poison → Stage 1 auto-recovery, which now
  completes because the leader thread is responsive again.
- Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so
  the post-abort comm teardown during recovery doesn't double-abort-panic).

Logs the whole sequence at ERROR with greppable `tp watchdog:` /
`ncclCommAbort` markers so a real-world hang leaves a forensic trail —
verification is by inspecting journals after real hangs, not a synthetic
harness. cuda-gated → validated by the blackwell build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:15:29 +03:00
c4f239ceb9 build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 6m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
#17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE
communicator from another thread — to unblock a collective wedged on a
dead/hung peer so the ranks can resync. No cudarc release (incl. main)
exposes this: the safe Comm only aborts in Drop, which can't fire while a
stuck thread holds an Arc<Comm> clone.

Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort,
rev 4dff0be) adding three thin methods — Comm::abort, get_async_error,
and a raw comm() accessor — to be submitted upstream. The patch targets
0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io.

Foundation only; the watchdog + abort + comm-rebuild that consume these
land in follow-up commits (cuda-gated → validated by the blackwell build).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 13:49:59 +03:00
ac445c1569 chore: re-trigger deploy (#17 Stage 1)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
CI / Format (push) Successful in 37s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 3m54s
build-prerelease / Build cortex binary (push) Successful in 4m43s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 8m10s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
No code change. The abc6e60 deploy's neuron-ada build died on the
degraded CI runner (container dropped mid-checkout), skipping the
gated publish — even though neuron-blackwell + neuron-ampere compiled
the Stage-1 fault-recovery code cleanly. Re-running to get a healthy
ada build so the RPMs publish and beast picks up the build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:34:20 +03:00
abc6e605b8 test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
One-shot, env-gated fault injector for beast verification: when
NEURON_DEBUG_POISON names a model, the first request for it triggers the
auto-recovery path as if a device fault had occurred — exercising
unload→reload→healthy without corrupting the GPU. Latched so it fires
exactly once (no recovery loop). No-op unless the env var is set; wired
into both the single-GPU and TP chat poison gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:08:40 +03:00
4f2957af9e feat(neuron): auto-recover poisoned models (#17 Stage 1c)
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.

- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
  `LoadedHandle::spec()`) so a poisoned model can be reloaded without an
  operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
  `new()` when a runtime is present) drains poisoned model ids and runs
  `unload_model` → `load_model(spec)`. Unload drops the model → cudarc
  `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
  init + sanity inside the load path, so a successful reload yields a
  fresh, healthy model. A failed reload leaves it unloaded (next load
  retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
  per model via a `recovering` set) and return a transient "recovering,
  retry shortly" error instead of the manual-reload message. Requests
  that arrive during the brief reload gap (model absent from the registry)
  also get "recovering" rather than a misleading "not loaded".

`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.

Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:05:02 +03:00
75cd088b61 fix(neuron): cap vision max_pixels to the pos_embed patch budget (#14)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m32s
build-prerelease / Build neuron-blackwell (push) Successful in 6m5s
CI / Test (push) Successful in 5m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m11s
build-prerelease / Build neuron-ada (push) Successful in 5m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m16s
Beast testing surfaced a real regression in the dynamic-resolution
default: a tall 808×1600 image resized (within the 1024² max_pixels) to a
90×44 patch grid = 3960 patches, exceeding the vision tower's hard
`num_position_embeddings = 2304` pos-embed budget. The per-rank
`patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP-
forward and poisoned the device context, bricking the model until reload.

Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches →
≤ 576 LM tokens), clamping even the operator env override. `smart_resize`
floors the pixel count under the cap, so no resized image can ever exceed
the budget — the tower check never fires, no poison. The pos-embed grid
(48×48) is the resolution Qwen3.6 was trained at, so the cap is
principled, not just defensive. Still ~3× the old fixed 196 tokens, and
the book-cover OCR test (1176 patches) already reads full title+subtitle.

Test: a huge/tall/wide/extreme image battery stays within the 2304 patch
budget. (Per-rank-error poison robustness itself remains issue #17.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 23:30:47 +03:00
d311c8ca7a feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 45s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m41s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 6m20s
build-prerelease / Build neuron-ampere (push) Successful in 7m18s
build-prerelease / Build neuron-ada (push) Successful in 5m10s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
- PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS /
  NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the
  NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain
  256²…1024² (64…1024 LM tokens/image).
- Test: a max-resolution source caps within the token budget (can't blow
  NEURON_MAX_PROMPT_TOKENS).
- Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from
  the preprocess, mod, and rope doc-comments now that resolution is
  dynamic and M-RoPE is implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:50:03 +03:00
c97a8654f5 feat(neuron): dynamic-resolution images via Qwen smart_resize (#14)
Some checks failed
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Replace the fixed 448×448-square preprocess with native-aspect
`smart_resize`, and thread the resulting per-image grid through the LM
so spatial structure survives non-square images (documents, screenshots,
charts, panoramas, OCR) instead of being squished into a square.

- preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32;
  pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens).
  `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/
  `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized
  `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for
  the TP leader's token count.
- rope.rs: `compute_mrope_index`/`get_rope_index` take per-image
  `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`.
  Walk image runs in order, validate `run == gh*gw`, emit row-major
  positions, resume the shared counter at `base + max(gh,gw)`. Correct
  for multiple images of differing grids interleaved with text.
- candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the
  `image_grid_factor` (patch×merge) instead of the constant 196; all four
  prompt-build sites compute per-image counts from each image's resized
  grid (single-GPU from the extracted `ImageInput.h/w`, TP from
  `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`.
- single-GPU (`mod.rs`, `dispatch.rs`) and TP
  (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`,
  `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank
  recomputes grids from its own deterministic preprocess — no rpc.rs
  change, single source of truth.

The vision tower itself was already grid-general (recent pos-embed
interpolation + 2D rotary fix). No patch-count cap: pos-embed is
interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT
attention + prefill) instead.

Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index`
non-square + two-image + mismatch cases, square-grid regression guard.
Non-cuda build + clippy + full workspace tests green; TP load/dispatch
paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget
config + remaining doc cleanup follow in C5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:47:27 +03:00
dc048ffcc9 fix(neuron): vision-tower 2D positions + M-RoPE default on
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 5m59s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-ampere (push) Successful in 7m51s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
Two fixes to the spatial handling of images, validated against the HF
transformers 4.57.1 qwen3_vl reference on beast.

**Vision tower (the real cause of poor spatial vision).** The Stage-A
tower encoded position two ways wrong, so the model saw image *content*
but not *layout* (a row of 5 people read as "a line of 23", sky
inverted), regardless of the LM-side rope:

- Learned pos-embed was a naive sequential lookup of the first
  `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid —
  wrong stride for a 28×28 patch grid. Now bilinearly interpolates the
  grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major.
- The 2D vision rotary was absent entirely. Added
  `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch
  `(row, col)` rotary to q/k in every ViT block via rope_slow, matching
  HF `apply_rotary_pos_emb_vision`.

Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1`
revert each for A/B (no rebuild). New unit tests: interpolation reduces
to the sequential lookup at the native grid; rotary row/col structure.

**M-RoPE default on.** The interleaved M-RoPE matches HF
apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥
plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in
— defaults should encode the model's trained behaviour, not freeze the
broken state.

Vision tower is plain candle (CPU-testable): built, clippy-clean, full
workspace tests green locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 20:53:07 +03:00
7ebcfba5ca fix(neuron): gate M-RoPE behind NEURON_MROPE (default off)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 6m14s
CI / Test (push) Successful in 6m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Successful in 5m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
On beast the interleaved M-RoPE degraded image understanding rather than
fixing it: the model misread spatial layout (a horizontal row of people
described as a "diagonal receding line"), got attributes wrong, and
rambled — a "how many people" follow-up generated 4459 tokens over 3.5
minutes, past agent-0's HTTP timeout (the "fails to respond without an
error"). The interleave is evidently not numerically correct, and it
can't be validated remotely without a transformers reference.

Gate it: `get_rope_index` now returns plain sequential identity
positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to
plain RoPE and image tokens behave exactly as pre-M-RoPE (content
recognition works; spatial layout approximate; no rambling). The real
computation moves to `compute_mrope_index` (still unit-tested). Default
off restores the working vision and unblocks agent-0; the M-RoPE code
stays in place to debug + validate before flipping the default on.

Pure non-cuda change (rope.rs); both single-GPU and TP forwards call
the gated get_rope_index unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 19:32:59 +03:00
825bf4e905 feat(neuron): M-RoPE Stage 4 — wire interleaved M-RoPE into the TP path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 42s
build-prerelease / Build cortex binary (push) Successful in 5m9s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 7m19s
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
CI / Clippy (push) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Mirror Stage 3 into the tensor-parallel Qwen3.6 model:

- TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar
  offset and apply via apply_cos_sin.
- TpQwen3_5Model gains the replicated rotary + rope_delta (reset in
  clear_kv_cache, settable). forward_inner builds the cos/sin once —
  interleaved M-RoPE from explicit position_ids (vision) or plain at
  offset+rope_delta (text/decode). forward() and forward_with_positions()
  delegate; the old single-shot forward_with_vision is gone.
- prefill_with_images_chunked now computes get_rope_index over the whole
  prompt once, stores rope_delta on the base model, and slices the
  (3, prompt_len) position tensor per chunk — so every rank assigns image
  tokens their 14×14 grid coordinates and steps in lockstep (every chunk,
  text or image, carries the M-RoPE slice because the image shifts the
  surrounding text positions).

Also build the position-id tensor as f32 directly (positions are small
integers, exact in f32) to avoid an i64→f32 cast on the GPU.

The TP forward is cuda-gated — CI CUDA type-check is the compile gate.
Non-cuda build + clippy + full workspace tests green; rope math + the
plain-RoPE-reduction invariant covered by unit tests.

Completes the interleaved-M-RoPE work for the vision spatial misread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:46:27 +03:00
4c12c7e2f0 feat(neuron): M-RoPE Stage 3 — wire interleaved M-RoPE into single-GPU
Qwen3_5Model now builds the rotary cos/sin once per forward and threads
(cos, sin) through the decoder → full-attention → rope, replacing the
scalar offset that reached RotaryEmbedding:

- vision forward computes get_rope_index over the (single-shot) prompt,
  sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens
  carry their 14×14 grid (height/width) positions;
- text / decode take plain_cos_sin at offset + rope_delta — with
  rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and
  the device→host id copy is skipped on the text decode hot path.

rope_delta is stored on the model and reset in clear_kv_cache, so decode
after a vision prefill resumes text positions from the image-compressed
counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset;
linear-attention layers are unchanged (no RoPE). The TP path still uses
the retained apply(offset) — wired in Stage 4.

Full workspace tests green; the load-bearing invariant (M-RoPE == plain
for equal axes) keeps text unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:39:52 +03:00
ba1b5ba408 feat(neuron): M-RoPE Stage 2 — get_rope_index position-id helper
Pure function computing the interleaved-M-RoPE 3D position ids for a
prompt with image-placeholder runs, plus the decode rope_delta:
text tokens advance a single counter (all axes equal); each image run
gets [base+t, base+h, base+w] row-major over a square grid_t=1,
grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from
base + max(grid). rope_delta = final_counter - seq_len lets decode
resume text positions after the position-compressed image blocks.
Plus mrope_position_tensor to build the (3, seq) tensor.

Unit tests: text-only is sequential (delta 0); text+image+text matches
hand-computed grid ids + resume + delta; 196 → 14×14; non-square run
rejected; end-to-end through mrope_cos_sin tracks the height axis.

#[allow(dead_code)] until Stage 3/4 wire it into the forward.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:34:28 +03:00
5731f4c318 feat(neuron): M-RoPE Stage 1 — interleaved rope machinery + config
Parse + store mrope_section / mrope_interleaved in RopeParameters
(previously accepted-but-ignored). RotaryEmbedding gains:
- inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section;
- plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode);
- mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the
  interleave columns (vision);
- apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out.

The existing apply(q,k,offset) is retained (delegates to
plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages
3–4 move cos/sin construction into the model forward and thread the 3D
position ids for image tokens.

Tests: masks partition the half-dim; interleave drives the right axis
per column; and the load-bearing invariant — mrope_cos_sin reduces
bit-for-bit to plain_cos_sin when the three axes are equal (so text
inference is unchanged).

Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda;
no behaviour change until wired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:31:15 +03:00
fa013505d1 fix(neuron): chunked TP-vision prefill + pre-flight VRAM guard
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 29s
build-prerelease / Build cortex binary (push) Successful in 4m26s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 6m6s
build-prerelease / Build neuron-ampere (push) Successful in 8m30s
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 47s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
CI / Test (push) Successful in 6m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m32s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
agent-0 sent a ~13k-token prompt + image; the TP vision prefill was
single-shot, so it tried to materialise activations for all 12,960
positions at once and OOM'd rank 1 mid-forward. Rank 1 died before
issuing its row-parallel AllReduce, stranding rank 0 on the collective
(it hung holding the pool lock). The text path survives the same size
because it chunks the prefill.

Chunk the vision prefill the same way:

- TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s)
  once, then walks the pre-expanded prompt in prefill_chunk_tokens()
  windows, splicing the patch-embedding rows into whichever chunk(s)
  carry <|image_pad|> positions (pure-text chunks take the plain
  forward). Activation is bounded by the chunk, not the prompt.
- Every rank runs the identical chunk sequence (chunk_size threaded
  through GenerateStepWithImages / TpForwardLogitsWithImages /
  generate_step_with_images), so the per-chunk AllReduces stay paired
  across ranks with no extra sync — the KV cache accumulates via the
  growing offset, only the last chunk's logits are kept.

Pre-flight guard (validate_vision_prefill): even chunked, a long
prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs
the collective. Reject up front with a clean InsufficientVram when the
estimated footprint exceeds free VRAM, so a doomed request fails fast
instead of hanging the daemon. Heuristic + tunable
(NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive
so the now-working 12,960-token case still passes. Applied to every
vision path (single-GPU + TP); single-GPU vision stays single-shot for
now, so the guard is its protection until it's chunked too.

Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size.
The chunked forward is cuda-gated — CI CUDA type-check validates it.

Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the
daemon (needs restart); making a worker failure abort the leader's
collective is separate, broader TP hardening.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 17:21:36 +03:00
c8bcaabc38 fix(neuron): render HF chat templates via minijinja pycompat
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 34s
CI / CUDA type-check (push) Successful in 39s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
CI / Test (push) Successful in 6m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 7m43s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
The Qwen3.6 chat_template.jinja (now loaded after the precedence fix)
failed to render in minijinja: it uses Python str methods
(content.startswith/endswith/split/rstrip/lstrip) and the raise_exception
global that HF transformers patches into its Jinja env but minijinja
doesn't provide. The render error tripped the text-only fallback, so
image requests still produced zero <|image_pad|> tokens.

Wire the standard bridge into render_chat_template:
- minijinja-contrib `pycompat::unknown_method_callback` supplies the
  Python string/list/dict methods;
- a `raise_exception` global maps to a render error (so malformed inputs
  — e.g. an image in a system message — surface cleanly).

Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF
cache) as a test fixture and assert it renders one <|image_pad|> for a
text+image turn — the end-to-end check that would have caught this
before deploy.

Refs #16 / TP-vision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:32:23 +03:00
7ad56c6a86 fix(neuron): load chat_template.jinja (transformers precedence)
The chat-template loader only read the `chat_template` field from
tokenizer_config.json. Qwen3.6-27B ships its vision-aware template
*only* in a standalone `chat_template.jinja` (and has no
tokenizer_config.json at all), so the loader returned None and image
requests fell back to the text-only format_qwen3_prompt — rendering
zero `<|image_pad|>` tokens and tripping
"expand_image_pad_tokens: prompt has 0 image_token_id occurrences".

load_chat_template_alongside now follows HF transformers precedence:
standalone chat_template.jinja → chat_template.json → the
chat_template field in tokenizer_config.json. Tests cover the
precedence, the text-only fallback, and that an OpenAI image_url
content part renders `<|image_pad|>` through the real template
condition (`'image_url' in item`).

Refs #16 / TP-vision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:25:30 +03:00
1b0e36c119 fix(neuron): cover TpForwardLogitsWithImages in drain_poisoned match
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m18s
build-prerelease / Build neuron-blackwell (push) Successful in 5m48s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 6m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m26s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
The CUDA type-check caught a non-exhaustive match: drain_poisoned()
must reply an error to every Job variant's reply channel, including the
new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't
see it — the variant is #[cfg(feature = "cuda")], so the match is
exhaustive without it on CPU.

Refs TP-vision plan Stage 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:26:46 +03:00
ed2d09864e feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill
Some checks failed
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m51s
CI / Test (push) Successful in 5m52s
CI / CUDA type-check (push) Failing after 50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now
conditions on the image across both ranks.

- TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image,
  populated at load via the shared VisionMeta::from_config_path (same
  config.json the shards loaded from; Stage 1 materialises the replicated
  tower on every rank).
- LoadedHandle::capabilities() now advertises "vision" for TP loads with
  a tower (cortex-gateway already unions this into /v1/models via C3).
- The TP rejection guards (chat_completion_tp + inference_tp_stream) are
  now conditional on !has_vision — text-only TP models still 400 cleanly,
  vision-capable ones fall through.
- chat_completion_tp_inner and the streaming orchestration task detect
  images (request_has_images), expand <|image_pad|> to the per-image
  patch count, and run a single-shot generate_step_with_images prefill
  (every rank encodes + splices its replicated tower) before the
  unchanged decode loop. Text requests keep chunked_prefill_tp.
- extract_image_data_uris ships the source data URIs to every rank for
  identical per-rank preprocessing.

prompt_tokens now reflects the patch expansion, so usage accounting and
KV offsets match the single-GPU baseline.

TP entry points are cuda-gated (validated by CI's CUDA type-check);
capabilities() + extract_image_data_uris + VisionMeta reuse compile on
the non-cuda build. Full workspace test green.

Refs TP-vision plan Stage 3. Implements #12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:14:44 +03:00
4994b94c84 feat(neuron): TP-vision Stage 2 — per-rank image RPC + worker plumbing
Carry image content through the TP forward path so every rank encodes
and splices locally (replicated tower, no embedding broadcast).

- rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source
  image data URIs + image_token_id for the single-shot vision prefill;
  worker still replies GenerateStepOk. Round-trip test added.
- tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each
  preprocessed image through the rank's replicated tower, cat, splice,
  forward. Shared by leader and worker so every rank runs identical work.
- tp/mod.rs: TpLeaderModel::forward_with_images and
  WorkerPool::generate_step_with_images (mirrors generate_step: fan out
  GenerateStepWithImages to subprocess ranks, run the leader's image
  forward on its device worker thread, drain, combine).
- worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images
  — each subprocess rank preprocesses the same data URIs via the shared
  deterministic preprocess_data_uri, encodes, splices, forwards.
- device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images
  dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images.

Determinism: every rank runs the same preprocess on the same source
URIs through the same replicated tower, so the spliced hidden state
matches across ranks — preserving the replicated-hidden-state invariant
the row-parallel AllReduce relies on, with no NCCL broadcast.

No caller yet — Stage 3 wires the TP chat/stream entry points to invoke
generate_step_with_images for image prefill. cuda-gated plumbing covered
by CI's CUDA type-check; rpc/route/forward_with_images compile on the
non-cuda build.

Refs TP-vision plan Stage 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:08:08 +03:00
9a24b05866 feat(neuron): TP-vision Stage 1 — replicated vision tower on the TP model
Load the full, unsharded model.visual.* vision tower on every TP rank
(leader + each subprocess worker mmaps the same local safetensors) when
config.vision_config is present. VisionTower::load already takes a
ShardedVarBuilder whose plain .get() returns the full replicated tensor,
so the tower loads identically regardless of world_size — no sharding,
no NCCL broadcast.

- TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id,
  plus has_vision/image_token_id/encode_image/forward_with_vision,
  mirroring the single-GPU Qwen3_5ForCausalLM wrapper.
- TpQwen3_5Model::forward_with_vision mirrors the single-GPU
  forward_inner splice: embed locally, replace rows at image_token_id
  positions, run the sharded decoder stack. Because every rank encodes
  the same pixels through its replicated tower, the spliced input
  embeddings are identical across ranks — preserving the TP
  replicated-hidden-state invariant the row-parallel AllReduce relies on.
- splice_runs is now pub(crate) and shared with the TP model.

No caller yet — Stage 2 wires the RPC/worker path that invokes
encode_image + forward_with_vision per rank. Most of this compiles on
the non-cuda build (only the cuda load variant's tower line is gated);
CI's CUDA type-check covers the rest.

Refs TP-vision plan Stage 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:00:05 +03:00
7bb033b4ed chore: untrack stray .claude/scheduled_tasks.lock and gitignore .claude/
All checks were successful
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Clippy (push) Successful in 2m45s
build-prerelease / Build cortex binary (push) Successful in 4m28s
CI / Test (push) Successful in 6m6s
build-prerelease / Build neuron-blackwell (push) Successful in 6m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m28s
build-prerelease / Build neuron-ampere (push) Successful in 8m1s
build-prerelease / Build neuron-ada (push) Successful in 8m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
A runtime scheduler lock was accidentally swept into the previous
commit by `git add -A`. Remove it from tracking (file stays on disk)
and ignore the whole `.claude/` dir so local agent runtime state never
lands in the repo again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:55:05 +03:00
f8c0da0ebf fix(neuron): TP-vision Stage 0 — reject image requests on the TP path
Some checks failed
build-prerelease / Resolve version stamps (push) Waiting to run
CI / Format (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Clippy (push) Has been cancelled
CI / Test (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The TP inference path has no vision tower, and the TP dispatch in
chat_completion / inference_stream returns before the VisionUnsupported
guard runs — so an image request to a TP-loaded model (e.g. beast's
tp=2 Qwen3.6-27B) was silently dropped and answered from text alone,
the exact issue-#3 confident-hallucination pattern Stage C killed for
single-GPU.

Add the request_has_images → VisionUnsupported guard to both
chat_completion_tp and inference_tp_stream, before prefill / before the
SSE stream opens, so beast returns a clean 400 vision_unsupported. The
guard is unconditional for now (TP has no tower); Stage 3 makes it
conditional on the TP model's has_vision once real TP-vision lands.

Detection is covered by the existing request_has_images unit test; the
guard itself is cuda-gated (validated by CI's CUDA type-check).

Refs TP-vision plan Stage 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:53:56 +03:00
dd592d918d test(neuron): C2 — guard Responses→chat image translation contract
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 44s
CI / Clippy (push) Successful in 2m51s
build-prerelease / Build cortex binary (push) Successful in 4m42s
build-prerelease / Build neuron-blackwell (push) Successful in 5m52s
CI / Test (push) Successful in 6m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ada (push) Successful in 5m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The Responses request translator already emits the chat `image_url`
Parts array Stage B5's vision path consumes, and the non-streaming
(`chat_completion`) and streaming (`responses_stream` → `inference_stream`,
Stage C1) Responses paths both route image content to the vision-aware
prefill — so vision works end-to-end through `/v1/responses` with no
translator change required.

Add a multi-image test asserting order preservation and that the
`detail` hint is tolerated (and dropped, since chat image_url has no
analogue), locking the translator's output to the exact
`image_url.url` shape `extract_images_from_request` walks.

Closes part of #16 (Stage C2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:57:43 +03:00
766c20ba47 feat(neuron): C1 — streaming SSE chat completion with vision
The streaming worker path now splices image embeddings on prefill,
closing the silent text-only degrade for `stream=true` image requests.

`inference_stream` gains the same vision-routing block as the
non-streaming `chat_completion`: detect `image_url` content, reject it
against text-only models with `VisionUnsupported` (before any SSE frame
is sent), preprocess each image and expand its `<|image_pad|>` sentinel
to the per-image patch count, then carry the payload through dispatch.

Rather than duplicate the 75-line `route_token!` reasoning/tool-call
state machine into a sibling streamer, `stream_inference_via_worker`
takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a
single-shot `forward_logits_with_images` splice; when `None`, the
original chunked text-only prefill. Image embeddings are prefill-only,
so every decode step stays on the plain `forward_logits` path and the
shared decode loop is untouched. This keeps exactly one copy of the
tool-call/reasoning logic to maintain.

The Responses API streaming path (`responses_stream`) inherits vision
for free since it drives the same `inference_stream`.

Unit test covers `request_has_images` (the shared routing gate); the
real-weights SSE smoke is the manual curl on beast (cuda-integration).

Closes part of #16 (Stage C1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:57:02 +03:00
4972c7d1e7 feat(cortex-gateway): C3 — propagate vision capabilities through /v1/models
ModelEntry and CortexModelEntry gain a `capabilities: Vec<String>`
field (serde-default for back-compat). The poller copies it verbatim
from each neuron's ModelInfo.capabilities; list_models computes the
union across every node where a model is loaded so a checkpoint loaded
text-only on one neuron and text+vision on another reports both to the
fleet. Catalogue-only and mid-prewarm entries default to empty until
the catalogue gains a capabilities declaration.

Aliases inherit their target's capability union. New gateway test mocks
two nodes with differing capability arrays and asserts the unioned
/v1/models response.

Closes part of #16 (Stage C3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:49:54 +03:00
a26bb9f04b feat(deploy): capture service startup journal after each restart
After both `Start cortex.service` and `Start neuron.service`, sleep 10s
and run `journalctl --unit <unit> -I --no-pager` to record the latest
invocation's log in the workflow output. Step is guarded by
`if: always()` so a failed start still leaves a usable trace.

infra-setup.sh now adds gitea_ci to the systemd-journal group during
user provisioning, so `journalctl` works without a sudoers entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 16:48:56 +03:00
ea1fdf8aa6 chore(deploy): drop deploy.sh and manifest.yml now that workflow runs
First end-to-end run of the deploy workflow succeeded (gitea run #289),
so the operator-run rolling-deploy script and its YAML manifest are no
longer the source of truth — fleet topology lives in
.gitea/workflows/deploy.yml and per-host config in script/infra-setup.sh.

Per-host neuron config comments updated to point at the new sync path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 16:41:04 +03:00
577781de8d fix(neuron): derive Clone on ImageInput for the CUDA vision dispatch
All checks were successful
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m47s
build-prerelease / Build cortex binary (push) Successful in 4m34s
CI / Test (push) Successful in 6m14s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 8m5s
build-prerelease / Build neuron-ada (push) Successful in 8m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CUDA type-check in CI failed on commit 24968e9 with E0308:

  error[E0308]: mismatched types
      --> crates/neuron/src/harness/candle.rs:1707:33
   1707 |                                 images.clone(),
        |                                 ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`,
                                                          found `&Vec<ImageInput>`

In Stage B5 the cuda branch of `chat_completion` matches
`&vision_route` to keep the `vision_route: Option<...>` alive for
both arms, which makes `images` bind as `&Vec<ImageInput>`. The
subsequent `images.clone()` call doesn't deep-clone because
`ImageInput` doesn't derive `Clone` — rustc falls back to cloning
the `&Vec` reference, which has the wrong type for the worker job.

The CPU build (non-cuda) compiled fine because that branch is
behind `#[cfg(feature = "cuda")]`; the cuda-check job is what
catches the regression.

Fix: derive `Clone` on `ImageInput`. The clone cost is one
pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which
is fine on the chat-completion hot path — vision requests are
rare per second relative to text-only decode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 15:51:57 +03:00
24968e9233 feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 33s
CI / CUDA type-check (push) Failing after 46s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Build neuron-blackwell (push) Failing after 5m35s
CI / Test (push) Successful in 6m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Failing after 7m46s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ada (push) Failing after 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Stage B of the vision plan (doc/vision-qwen3_6-spec.md). Wires
the vision tower from Stage A through to a complete non-streaming
chat completion: extract images from the request, preprocess,
encode on the worker thread, splice embeddings into the LM input
at `<|image_pad|>` positions, return coherent text response with
`prompt_tokens` reflecting patch tokens.

Closes the silent-drop class of failures from issue #3 — vision
requests against Qwen3.6 now condition the model on the image
instead of producing confident text-only hallucinations.

Streaming for vision is Stage C. Deferred items tracked under
#12 (TP-vision), #13 (27B production), #14 (dynamic resolution),
#15 (numerical validation).

What landed:

- **B1 — `Qwen3_5Model::forward_with_vision`**: text-only `forward`
  unchanged; new method takes `(input_ids, offset, image_embeds,
  image_token_id)`, embeds tokens, locates `image_token_id`
  positions, splices via the new `splice_runs` helper. MRoPE
  applies text-positions to image tokens for Stage B (spatial
  MRoPE is the issue #15 numerical-validation follow-up). 2 unit
  tests for `splice_runs` covering contiguous + non-contiguous
  runs.

- **B2 — `ModelArch::forward_with_vision` dispatch**: routes
  Qwen3_5Dense to the new method; other arches return an error.
  Defence-in-depth — the HTTP layer (B6) already rejects image
  content for non-vision models.

- **B3 — `Job::ForwardLogitsWithImages`**: new worker variant
  carrying tokens + per-image `(pixels, c, h, w)` payloads. The
  dispatcher encodes each image (device-resident), concatenates
  the resulting embeddings, calls `arch.forward_with_vision`, and
  returns CPU logits. Image embeddings never copy back to CPU —
  the "tensors don't escape the worker" invariant from the
  per-device worker refactor still holds. Poisoned-worker drain
  path handles the new variant.

- **B4 — Prompt builder**:
  - `request_has_images` detects image content cheaply.
  - `extract_images_from_request(request, profile)` walks
    `MessageContent::Parts`, decodes data URIs, runs
    `harness::preprocess::preprocess` per image, returns
    `Vec<ImageInput>` in request order.
  - `expand_image_pad_tokens(input_ids, image_token_id,
    patches_per_image)` walks the tokenized prompt and replaces
    each `<|image_pad|>` (id 248056 for Qwen3.6) with N copies
    matching the per-image patch count. 4 unit tests.
  - `VisionMeta::from_config_path` peeks `config.json` at load
    time for `image_token_id`, vision_config patch/merge sizes,
    and derives `lm_tokens_per_image` for the Stage B fixed
    resolution.

- **B5 — `chat_completion` vision routing**: detects image
  content, validates the loaded model has vision, expands the
  prompt, and calls a new `run_inference_with_images_via_worker`
  helper that does single-shot prefill + standard decode loop
  (KV cache holds the post-splice hidden states from prefill, so
  decode steps don't re-splice). Stage B skips chunked prefill
  for vision — at 448×448 fixed resolution the budget stays well
  under the activation-memory threshold. Long-vision chunking is
  Stage D follow-up.

- **B6 — `InferenceError::VisionUnsupported`**: structured 400
  with `code=vision_unsupported, model_id, suggestion` when an
  image request hits a non-vision model. Closes the agent0
  failure mode where vision requests degraded silently.

- **B7 — `ModelInfo.capabilities`**: per-model array (`["text"]`
  vs `["text", "vision"]`) in `/v1/models` and forwarded verbatim
  by cortex-gateway. Lets clients (litellm, agent0) gate
  image_url submission on the declared capability set. Optional
  in the wire format; defaults to empty for older clients.

CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
124 lib tests). New unit-test counts: +2 splice_runs, +4
expand_image_pad.

Manual verification (after RPMs deploy on beast):

  curl http://hanzalova.internal:31313/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"Qwen/Qwen3.6-27B\", \"messages\":[{\"role\":\"user\",\"content\":[
      {\"type\":\"text\",\"text\":\"What's in this image?\"},
      {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,...\"}}
    ]}], \"max_tokens\":120}" | jq

  Expect prompt_tokens > 196 (text + 196 patch tokens) and a
  response that references actual image content.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 15:33:00 +03:00
7df84fed8f feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m23s
build-prerelease / Build neuron-ampere (push) Successful in 7m56s
CI / Test (push) Successful in 7m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Stage A of the vision implementation plan
(doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding
that today's silent-drop failure mode (issue #3) needs — the
Qwen3.6 ViT loads from `model.visual.*`, runs forward producing
post-merger LM-side image embeddings, and routes through the
device worker via a new `Job::EncodeImage`. No LM splice yet —
that's Stage B.

Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision),
#13 (27B production deploy), #14 (dynamic resolution), #15
(numerical validation).

What landed:

- **A0 — investigation**: pulled config.json, preprocessor_config.json,
  chat_template.jinja, and safetensors index from beast's local
  Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with
  exact tensor shapes for every `model.visual.*` weight. Confirms
  27-block ViT with `hidden_size=1152`, `patch_size=16`,
  `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives
  in 2 of the 15 safetensors shards.

- **A1 — deps + scaffolding**: added `image = "0.25"` (default-
  features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to
  crates/neuron/Cargo.toml. Created `harness::preprocess` and
  `harness::arch::qwen3_5::vision` modules.

- **A2 — preprocess.rs**: `decode_data_uri` strips
  `data:image/...;base64,...` → image bytes → `image::DynamicImage`
  (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess`
  resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448),
  normalises to `[-1, 1]` per the model's mean/std=0.5, emits
  row-major `(3, H, W)` f32. 9 unit tests covering data URI parse,
  decode failure paths, grayscale-to-RGB promotion, and the
  exact-value normalisation contract.

- **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`,
  learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN +
  multi-head self-attention with fused QKV + GELU-tanh MLP +
  residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat →
  linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size).
  Includes the Conv3d→Conv2d fold trick documented at the top of
  the file — the published patch_embed.proj.weight is 5D
  `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static
  images we sum-collapse the temporal axis. Video would need real
  Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh`
  reference values from PyTorch.

- **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config`
  with optional `vision_config: Option<VisionConfig>` and
  `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision
  tower when present, exposes `has_vision()` and `vision()` so the
  HTTP layer can advertise capability and so the encode path can
  reach it.

- **A5 — device worker `Job::EncodeImage`**: new job variant carrying
  CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the
  tensor on the worker's device, calls `arch.encode_image(image)`,
  copies the result back to CPU as flat `Vec<f32>`. Keeps the
  "tensors don't escape the worker" invariant. Poisoned-worker
  drain path handles the new variant.

- **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_
  and_errors_on_unknown_handle` proves the channel/dispatch wiring
  works end-to-end via the CPU device worker (errors on unknown
  ArchHandle, which is the expected behaviour without a loaded
  model — real-weights validation happens in Stage B when the LM
  splice path exists).

CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
zero failures). New test counts: +9 in preprocess, +5 in vision,
+1 in device_worker.

Out of scope (deferred):
- LM-side splice of image embeddings at `<|image_pad|>` positions
  → Stage B.
- Streaming SSE for vision-bearing chat completions → Stage C.
- Reject `image_url` with HTTP 400 for non-vision models /
  advertise `capabilities` in /v1/models → Stage C.
- TP-vision (#12), 27B production deploy (#13), dynamic resolution
  (#14), numerical validation (#15).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 11:40:47 +03:00
5c520c7e90 feat(deploy): gitea workflow for rolling RPM deploys + host bootstrap
Replace operator-run script/deploy.sh with a CI-driven rolling deploy:

- .gitea/workflows/deploy.yml fires on build-prerelease success (and is
  re-runnable via workflow_dispatch). Cortex upgrades first on
  hanzalova.internal; the three neuron hosts upgrade in parallel under
  fail-fast: false so one failing host doesn't sink the rest.
  Concurrency-grouped to serialize overlapping deploys, never cancelling
  in-flight runs (a half-applied dnf transaction is worse than a stale
  deploy).

- asset/sudoers.d/{cortex,neuron}-host.conf are the canonical source for
  the scoped privileges gitea_ci needs on each host kind, installed as
  /etc/sudoers.d/helexa_gitea_ci. URLs and = signs are backslash-escaped
  per sudoers reserved-character rules.

- script/infra-setup.sh idempotently provisions the gitea_ci user,
  installs the runner pubkey, drops in the appropriate sudoers fragment
  with visudo verification, and syncs cortex.toml / models.toml /
  per-host asset/neuron/<short>.toml — config still ships from operator
  workstations rather than CI because the first two are gitignored.

The CI-only secret is RSYNC_SSH_KEY (already configured for the repo);
the matching pubkey is ~/.ssh/id_gitea_ci.pub on the operator's box.

script/deploy.sh and asset/manifest.yml are left in place until the
first end-to-end deploy workflow run succeeds, then removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 14:58:23 +03:00
d0292ed377 feat(cortex): catalogue source field + scheme-qualified /models/load
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 40s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m13s
build-prerelease / Build neuron-ampere (push) Successful in 7m31s
build-prerelease / Build neuron-ada (push) Successful in 8m16s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m21s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Build cortex binary (push) Successful in 4m5s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Phase 3 of plan-source-aware-loader-preflight. Adds an optional
`source` field to `ModelProfile` and threads it through the
router's cold-load path so a profile pointing at the helexa
registry forwards `helexa:<id>` to neuron's `/models/load`
instead of leaving neuron to substitute its `default_source`
(typically `huggingface`).

Without this, an operator who declares
`source = "helexa"` in models.toml would still see neuron fetch
from HuggingFace — the catalogue → ModelSpec translation in
`profile_to_spec` was dropping the scheme on the floor.

What lands:

- `cortex-core::catalogue::ModelProfile.source: Option<String>`.
  None is the default and preserves pre-Phase-3 behaviour.
- `cortex-gateway::router::qualified_model_id(profile)` —
  small pure helper, extracted from `profile_to_spec` so it can
  be unit-tested. Empty-string `source` is treated as None so
  operators who blank out a previously-set value don't trip a
  scheme-with-no-scheme failure mode in neuron.
- `models.example.toml` documents the new field with a
  commented-out helexa-scheme example pointing back at
  neuron.example.toml's matching sources block.

Tests:

- 2 new unit tests in `cortex-core::catalogue`: source-absent
  round-trip and source-present round-trip through TOML.
- 3 new unit tests in `cortex-gateway::router`: pass-through
  when None, prefix when Some, pass-through on empty-string
  source.
- ModelProfile literal in catalogue's existing test updated to
  carry `source: None`.

CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace
(24 test groups ok, zero failures).

Completes Phase 3. With Phases 1+2+3 landed:
- neuron parses `scheme:org/name`, routes per-source hf-hub
  Api with disambiguated cache.
- preflight returns structured errors before any device
  allocation.
- cortex catalogue declares per-model source jurisdiction
  and forwards it to neuron.

The registry itself (registry.helexa.ai service, MinIO,
nginx, mirror fabric) is the next moving piece — landing
under a separate project per the design discussion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 14:53:58 +03:00
d4e1b05956 feat(neuron,cortex-core): source-aware loader (scheme:org/name)
All checks were successful
CI / CUDA type-check (push) Successful in 46s
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m40s
build-prerelease / Build cortex binary (push) Successful in 4m23s
CI / Test (push) Successful in 5m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m39s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 7m53s
build-prerelease / Build neuron-ada (push) Successful in 5m18s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Phase 1 of plan-source-aware-loader-preflight. Makes neuron's
loader treat `huggingface:org/name` and `helexa:org/name` as
first-class distinct sources with per-source endpoint + cache,
while staying backwards-compatible with bare `org/name` ids.
Zero behavior change for existing operator configs.

Motivation: helexa is adding an EU-hosted registry
(`registry.helexa.ai`) alongside HF. Both speak HF-compatible
wire format, but the bytes, jurisdiction, trust root, and cache
namespace are distinct. The loader needs to disambiguate which
registry serves a given model id, and to keep their caches from
colliding on disk when both happen to host the same `org/name`.

What lands:

- `cortex-core::source` — new module. `ModelSourceId { scheme,
  org, name }` with `FromStr` accepting both `scheme:org/name`
  and bare `org/name`. `Display` round-trips. `repo_path()`
  emits the `org/name` half for the hf-hub `Api::model(...)`
  call regardless of which scheme/endpoint we're hitting.
  Rejects malformed input with typed `ParseError` variants
  (empty scheme, missing slash, scheme with `/`, name with
  `:`, etc.).

- `neuron::config::CandleHarnessConfig` gains
  `default_source: Option<String>` and
  `sources: HashMap<String, SourceConfig>`. `SourceConfig`
  mirrors what `hf_hub::ApiBuilder` consumes: endpoint URL,
  optional `auth_env` (env var name read at startup so secrets
  stay out of TOML), and optional cache_dir. Defaults
  synthesise a `huggingface` entry pointing at
  `https://huggingface.co` with the legacy `hf_cache` field as
  its cache_dir — so existing configs that only set `hf_cache`
  keep working unchanged.

- `CandleHarness::new(bind_url, &CandleHarnessConfig)` replaces
  `CandleHarness::new(bind_url, hf_cache)`. Resolves every
  configured source's auth env var and cache dir up front so
  `hf_api_for(scheme)` is a pure HashMap lookup on the hot
  load path. Only the `huggingface` scheme gets the legacy
  `HF_HUB_CACHE`/`HF_HOME` env-var fallback chain; other
  schemes resolve to whatever the operator typed.

- `hf_api()` -> `hf_api_for(scheme)`. Builds an
  `hf_hub::Api` with the source's endpoint, cache_dir, and
  auth token. Errors with a useful message naming the
  configured schemes when an unknown scheme is requested.

- `CandleHarness::load_model` parses `spec.model_id` into a
  `ModelSourceId`, substitutes `default_source` for bare ids,
  and threads the parsed source through `preflight`,
  `resolve_files`, `resolve_dense_files`, `load_arch_gguf`,
  `load_arch_dense`, and `load_tp`. The hf-hub `Api::model()`
  call now uses `source_id.repo_path()` so registry calls hit
  the right URL shape regardless of scheme.

- `preflight()` signature gains a `&ModelSourceId` parameter
  (it's the canonical id for log lines and error display);
  `RepoFetchFailed.model_id` etc. now carry the
  scheme-qualified form so operator-visible errors echo
  exactly what was configured.

- `neuron.example.toml` documents the new
  `[harness.candle.sources.*]` table with commented-out
  examples for `huggingface` (explicit override) and `helexa`.

Tests:

- 13 new unit tests in `cortex-core::source` covering parse /
  display round-trip, default-scheme substitution semantics,
  and every `ParseError` variant.
- 6 new unit tests in `neuron::config` covering the
  `effective_sources` synth (legacy `hf_cache` carry-through,
  explicit override preservation, helexa-alongside-huggingface)
  and `effective_default_source` fallback.
- 2 new unit tests in `harness::candle::tests` covering
  multi-scheme `hf_api_for` routing, including the
  "unknown scheme" error path naming configured schemes.
- Preflight integration tests updated to construct
  `ModelSourceId` and assert against the scheme-qualified
  error form.

CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace (all 24
test groups ok, zero failures).

Out of scope (Phase 3):
- Cortex catalogue `source` field — independent of Phase 1+2,
  ships when the registry comes online.
- `helexa` source endpoint itself — separate project; this
  PR adds the client-side rails only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 13:42:11 +03:00
157 changed files with 25529 additions and 1114 deletions

View File

@@ -1,11 +1,20 @@
name: build-prerelease
# Manually-dispatched workflow that builds CUDA-flavoured neuron binaries
# (and a single cortex binary), packages each as a Fedora RPM, signs
# them, and publishes to the `unstable` channel at rpm.lair.cafe.
# Builds CUDA-flavoured neuron binaries (and a single cortex binary),
# packages each as a Fedora RPM, signs them, and publishes to the
# `unstable` channel at rpm.lair.cafe.
#
# Trigger from the Gitea UI: Actions → build-prerelease → Run workflow.
# Optionally provide a `ref` to build from a non-default branch.
# Change-aware: the `prepare` job diffs HEAD against the git sha
# embedded in the most recently *published* unstable RPM (per package)
# and skips builds whose inputs didn't change. Docs-only commits build
# nothing; gateway-only commits skip the 3 CUDA builds (and, via
# deploy.yml's own check-update gate, the neuron restarts + model
# cold-loads). Diffing against the published sha — not the previous
# push — means a failed run can never cause a change to be missed.
#
# Lint (fmt+clippy) and test run here as parallel jobs and gate
# `publish`; ci.yml no longer runs on pushes to main (see its trigger
# comment), so the two workflows stop competing for the same runners.
#
# The published packages are versioned as e.g.
# helexa-neuron-blackwell-0.1.16-0.1.20260518T140530.gitabcdef0.fc43.x86_64
@@ -22,6 +31,7 @@ on:
push:
branches: [main]
# Manual dispatch still available to build from a non-main ref.
# Dispatched runs skip change detection and build everything.
workflow_dispatch:
inputs:
ref:
@@ -29,15 +39,15 @@ on:
required: false
default: ""
# Coalesce same-ref pushes: a newer push cancels the older in-flight
# run — the newest commit is the one we want on the fleet. The publish
# job keeps its own `rpm-publish` group (cancel=false) so an in-flight
# repo update is never interrupted. Runners are ephemeral (one VM per
# job) so concurrent runs no longer race on a shared workspace; the
# old shared `cortex-runner-pool` group with ci.yml is gone.
concurrency:
# Share the group with ci.yml so the two workflows can't run
# concurrently on the same `rust` runner (act reuses the workspace
# cache and races destroy each other's build files mid-compile).
# cancel-in-progress=false → workflows queue; if a newer push lands,
# the older run is still picked up by ci.yml's own ref-keyed
# concurrency (same group, queued).
group: cortex-runner-pool-${{ github.ref }}
cancel-in-progress: false
group: build-prerelease-${{ github.ref }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
@@ -45,13 +55,18 @@ env:
jobs:
prepare:
name: Resolve version stamps
name: Resolve version stamps + change detection
timeout-minutes: 10
runs-on: rust
outputs:
version: ${{ steps.info.outputs.version }}
release: ${{ steps.info.outputs.release }}
short_sha: ${{ steps.info.outputs.short_sha }}
commit_timestamp: ${{ steps.info.outputs.commit_timestamp }}
build_cortex: ${{ steps.changes.outputs.build_cortex }}
build_neuron: ${{ steps.changes.outputs.build_neuron }}
build_bench: ${{ steps.changes.outputs.build_bench }}
check_rust: ${{ steps.changes.outputs.check_rust }}
steps:
- uses: actions/checkout@v4
with:
@@ -78,19 +93,164 @@ jobs:
echo "short_sha=${SHORT_SHA}" >> "$GITHUB_OUTPUT"
echo "commit_timestamp=${COMMIT_TIMESTAMP}" >> "$GITHUB_OUTPUT"
- id: changes
run: |
set -ux
# Default: build everything. Detection only ever narrows
# this, and any failure along the way (manifest unreachable,
# unparsable, sha not in history after a force-push) leaves
# the full build in place. Manual dispatches always build
# everything — predictable when building odd refs.
BUILD_CORTEX=true
BUILD_NEURON=true
BUILD_BENCH=true
CHECK_RUST=true
if [ "${GITHUB_EVENT_NAME}" = "push" ]; then
MANIFEST_URL="https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json"
if curl -fsS --max-time 20 -o /tmp/packages.json "$MANIFEST_URL"; then
# Latest published sha per package, by buildTime.
base_for() {
python3 - "$1" <<'PY'
import json, re, sys
name = sys.argv[1]
try:
with open("/tmp/packages.json") as f:
pkgs = json.load(f)["packages"]
cands = [p for p in pkgs if p.get("name") == name]
if cands:
latest = max(cands, key=lambda p: p.get("buildTime", 0))
m = re.search(r"git\.?([0-9a-f]{7,40})", latest.get("release", ""))
if m:
print(m.group(1))
except Exception:
pass
PY
}
# true if no usable base, else true iff the diff since
# the published sha touches the given path pattern.
decide() {
local base="$1" pattern="$2"
if [ -z "$base" ] \
|| ! git cat-file -e "${base}^{commit}" 2>/dev/null \
|| ! git merge-base --is-ancestor "$base" HEAD 2>/dev/null; then
echo true; return
fi
if git diff --name-only "${base}..HEAD" | grep -qE "$pattern"; then
echo true
else
echo false
fi
}
# cortex-core is shared by both binaries; Cargo.{toml,lock}
# affect both; this workflow file affects both.
NEURON_RE='^crates/neuron/|^crates/cortex-core/|^Cargo\.toml$|^Cargo\.lock$|^rpm/helexa-neuron-prerelease\.spec$|^data/neuron|^neuron\.example\.toml$|^\.gitea/workflows/build-prerelease\.yml$'
CORTEX_RE='^crates/cortex-gateway/|^crates/cortex-cli/|^crates/cortex-core/|^Cargo\.toml$|^Cargo\.lock$|^rpm/cortex-prerelease\.spec$|^data/cortex|^cortex\.example\.toml$|^models\.example\.toml$|^\.gitea/workflows/build-prerelease\.yml$'
BENCH_RE='^crates/helexa-bench/|^crates/cortex-core/|^Cargo\.toml$|^Cargo\.lock$|^rpm/helexa-bench-prerelease\.spec$|^data/helexa-bench|^helexa-bench\.example\.toml$|^\.gitea/workflows/build-prerelease\.yml$'
# Any Rust change (incl. crates not packaged here, e.g.
# helexa-acp) still needs lint+test on main.
RUST_RE='\.rs$|^crates/|Cargo\.toml$|^Cargo\.lock$'
CORTEX_BASE=$(base_for cortex)
NEURON_BASE=$(base_for helexa-neuron-blackwell)
BENCH_BASE=$(base_for helexa-bench)
BUILD_CORTEX=$(decide "$CORTEX_BASE" "$CORTEX_RE")
BUILD_NEURON=$(decide "$NEURON_BASE" "$NEURON_RE")
BUILD_BENCH=$(decide "$BENCH_BASE" "$BENCH_RE")
if [ "$BUILD_CORTEX" = "true" ] || [ "$BUILD_NEURON" = "true" ] || [ "$BUILD_BENCH" = "true" ]; then
CHECK_RUST=true
else
CHECK_RUST=$(decide "$CORTEX_BASE" "$RUST_RE")
fi
fi
fi
echo "build_cortex=${BUILD_CORTEX}" >> "$GITHUB_OUTPUT"
echo "build_neuron=${BUILD_NEURON}" >> "$GITHUB_OUTPUT"
echo "build_bench=${BUILD_BENCH}" >> "$GITHUB_OUTPUT"
echo "check_rust=${CHECK_RUST}" >> "$GITHUB_OUTPUT"
echo "### change detection: build_cortex=${BUILD_CORTEX} build_neuron=${BUILD_NEURON} build_bench=${BUILD_BENCH} check_rust=${CHECK_RUST}"
# fmt + clippy + test moved here from ci.yml for main pushes so the
# two workflows stop queueing against each other (ci.yml's checks
# used to delay build-cortex by ~12 minutes on the shared runner
# pool). They run in parallel with the builds and gate `publish`,
# not the builds themselves — a clippy warning still can't reach the
# fleet, but it also doesn't serialize the pipeline.
lint:
name: Lint (fmt + clippy)
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.check_rust == 'true'
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- run: cargo fmt --check --all
# Failure-aware sccache escalation lives in the shared script: a
# signal death (rustc SIGSEGV / OOM-kill) keeps the cache and fails
# fast instead of triggering a slower uncached rebuild; only a real
# sccache fault drops the cache. See script/ci-cargo-escalate.sh.
- name: Clippy (sccache escalation)
run: script/ci-cargo-escalate.sh cargo clippy --workspace -- -D warnings
test:
name: Test
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.check_rust == 'true'
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Test (sccache escalation)
run: script/ci-cargo-escalate.sh cargo test --workspace
build-cortex:
name: Build cortex binary
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.build_cortex == 'true'
# runner-rust image already provides rust/cargo/clippy/rustfmt via
# dnf — no rustup install step needed.
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Build cortex (release)
run: cargo build --release -p cortex-cli
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Build cortex (release, sccache escalation)
run: script/ci-cargo-escalate.sh cargo build --release -p cortex-cli
- name: Stage binary
run: |
@@ -104,9 +264,50 @@ jobs:
path: artifacts/cortex
retention-days: 1
build-bench:
name: Build helexa-bench binary
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.build_bench == 'true'
# Pure-Rust, non-CUDA binary — same runner as cortex.
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Build helexa-bench (release, sccache escalation)
run: |
# Stamp the SHA helexa-bench records as bench_sha against every
# run (option_env! in sweep.rs reads it at compile time).
export HELEXA_BUILD_SHA="$(git rev-parse HEAD)"
script/ci-cargo-escalate.sh cargo build --release -p helexa-bench
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/helexa-bench artifacts/helexa-bench
./artifacts/helexa-bench --version || true
- uses: actions/upload-artifact@v3
with:
name: bench-fc43
path: artifacts/helexa-bench
retention-days: 1
build-neuron:
name: Build neuron-${{ matrix.flavour }}
timeout-minutes: 35
needs: prepare
if: needs.prepare.outputs.build_neuron == 'true'
strategy:
fail-fast: false
matrix:
@@ -117,34 +318,53 @@ jobs:
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn flash-attn"
cargo_features: "cuda cudnn"
- flavour: ada
compute_cap: "89"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn flash-attn"
cargo_features: "cuda cudnn"
- flavour: blackwell
compute_cap: "120"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn flash-attn"
cargo_features: "cuda cudnn"
runs-on: ${{ matrix.runner }}
env:
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
# sccache handling + failure classification lives in
# script/ci-cargo-escalate.sh: it probes for sccache (the CUDA
# image may not ship it — a missing binary degrades to an uncached
# build rather than failing at `sccache rustc -vV`), and a rustc
# SIGSEGV / OOM-kill keeps the cache and fails fast instead of
# escalating to a slower uncached rebuild. The cache covers the
# ~600-crate host-side dep tree (the bulk of the 10-14 min build),
# shared across all three flavours, so even one run seeds the next.
- name: Build neuron with CUDA (${{ matrix.flavour }})
run: |
set -eux
export PATH="${{ matrix.cuda_home }}/bin:${PATH}"
export LD_LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LIBRARY_PATH:-}"
cargo build --release -p neuron --features "${{ matrix.cargo_features }}"
# Pin the build SHA neuron reports from GET /version. The git
# fallback in build.rs would also work on a full checkout, but
# injecting the exact checked-out commit is unambiguous under
# shallow/detached states and makes the artifact self-describing.
export HELEXA_BUILD_SHA="$(git rev-parse HEAD)"
script/ci-cargo-escalate.sh cargo build --release -p neuron --features "${{ matrix.cargo_features }}"
env:
CUDA_COMPUTE_CAP: ${{ matrix.compute_cap }}
CARGO_BUILD_JOBS: ${{ matrix.build_jobs }}
@@ -164,6 +384,7 @@ jobs:
package-cortex:
name: Package cortex RPM
timeout-minutes: 20
needs: [prepare, build-cortex]
runs-on: rpm
steps:
@@ -200,8 +421,47 @@ jobs:
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
package-bench:
name: Package helexa-bench RPM
timeout-minutes: 20
needs: [prepare, build-bench]
runs-on: rpm
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- uses: actions/download-artifact@v3
with:
name: bench-fc43
path: artifacts/
- name: Build RPM
run: |
set -eux
rm -f ~/.rpmmacros
rpmdev-setuptree
cp artifacts/helexa-bench ~/rpmbuild/SOURCES/
cp data/helexa-bench.service ~/rpmbuild/SOURCES/
cp data/helexa-bench-sysusers.conf ~/rpmbuild/SOURCES/
cp data/helexa-bench-firewalld.xml ~/rpmbuild/SOURCES/
cp helexa-bench.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/helexa-bench-prerelease.spec \
--define "bench_version ${{ needs.prepare.outputs.version }}" \
--define "bench_prerelease ${{ needs.prepare.outputs.release }}" \
--undefine dist \
--define "dist .fc43"
- uses: actions/upload-artifact@v3
with:
name: rpm-bench-fc43
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
package-neuron:
name: Package helexa-neuron-${{ matrix.flavour }} RPM
timeout-minutes: 20
needs: [prepare, build-neuron]
runs-on: rpm
strategy:
@@ -247,7 +507,22 @@ jobs:
publish:
name: Publish to rpm.lair.cafe (unstable)
needs: [package-cortex, package-neuron]
timeout-minutes: 25
needs: [lint, test, package-cortex, package-neuron, package-bench]
# Runs when at least one package was built and nothing failed.
# lint/test may be skipped (docs-only refs never get here because
# no packages build), but a real failure in any blocks the
# fleet from receiving the RPMs.
if: >-
${{
!cancelled()
&& (needs.lint.result == 'success' || needs.lint.result == 'skipped')
&& (needs.test.result == 'success' || needs.test.result == 'skipped')
&& (needs.package-cortex.result == 'success' || needs.package-neuron.result == 'success' || needs.package-bench.result == 'success')
&& needs.package-cortex.result != 'failure'
&& needs.package-neuron.result != 'failure'
&& needs.package-bench.result != 'failure'
}}
runs-on: rpm
concurrency:
group: rpm-publish

View File

@@ -1,21 +1,25 @@
name: CI
# Pushes to main are deliberately excluded: build-prerelease.yml runs
# its own lint/test jobs there (gating publish), and running both
# workflows on the same push made them queue against each other on the
# same runner labels — ~12 minutes of added latency per deploy. Feature
# branches, PRs to main, and release tags keep the full gate here.
on:
push:
branches: ["**"]
branches-ignore: [main]
tags: ["v*"]
pull_request:
branches: [main]
# Share a concurrency group with build-prerelease.yml so the two
# workflows don't race on the same `rust` runner workspace (act's
# /root/.cache/act/<hash>/hostexecutor/ is shared across concurrent
# jobs and one job's checkout step nukes another's in-flight build
# files). cancel-in-progress=false → they queue; same-ref pushes
# coalesce per workflow via cancel-in-progress on each.
# Coalesce same-ref pushes; a newer push supersedes the in-flight run.
# (The old shared `cortex-runner-pool` group with build-prerelease.yml
# is gone — the workflows no longer trigger on the same refs, and
# ephemeral one-VM-per-job runners removed the shared-workspace race
# that group existed to serialize.)
concurrency:
group: cortex-runner-pool-${{ github.ref }}
cancel-in-progress: false
group: ci-${{ github.ref }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
@@ -37,6 +41,7 @@ env:
jobs:
fmt:
name: Format
timeout-minutes: 15
runs-on: rust
steps:
- uses: actions/checkout@v4
@@ -44,53 +49,26 @@ jobs:
clippy:
name: Clippy
timeout-minutes: 25
runs-on: rust
steps:
- uses: actions/checkout@v4
# sccache occasionally fails with spurious race-condition errors;
# retrying the same invocation succeeds without code changes.
# Allow up to 3 attempts before declaring real failure.
- name: Clippy (with retry)
run: |
for attempt in 1 2 3; do
echo "::group::clippy attempt ${attempt}"
if cargo clippy --workspace -- -D warnings; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "clippy failed on attempt ${attempt}"
if [ "${attempt}" -lt 3 ]; then
sleep 5
fi
done
echo "clippy failed after 3 attempts"
exit 1
- run: sccache --show-stats
# Failure-aware sccache escalation lives in the shared script (kept
# in sync with build-prerelease.yml): a signal death (rustc SIGSEGV
# / OOM-kill) keeps the cache and fails fast instead of an uncached
# rebuild; only a real sccache fault drops the cache.
- name: Clippy (sccache escalation)
run: script/ci-cargo-escalate.sh cargo clippy --workspace -- -D warnings
test:
name: Test
timeout-minutes: 25
runs-on: rust
steps:
- uses: actions/checkout@v4
# See the clippy job for why this is retried.
- name: Test (with retry)
run: |
for attempt in 1 2 3; do
echo "::group::test attempt ${attempt}"
if cargo test --workspace; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "test failed on attempt ${attempt}"
if [ "${attempt}" -lt 3 ]; then
sleep 5
fi
done
echo "test failed after 3 attempts"
exit 1
- run: sccache --show-stats
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Test (sccache escalation)
run: script/ci-cargo-escalate.sh cargo test --workspace
# Type-check the CUDA-only code path. Borrow-check-only — we
# never run the tests here (the runner has no GPU). This catches
@@ -104,54 +82,44 @@ jobs:
# see commit history).
cuda-check:
name: CUDA type-check
timeout-minutes: 35
runs-on: cuda-13.0
# The workflow-level env sets `RUSTC_WRAPPER: sccache` for the
# `rust` runner (where fmt/clippy/test live and sccache is
# installed). The `cuda-13.0` runner doesn't have sccache on
# PATH, so inheriting the wrapper makes cargo bail with
# `could not execute process `sccache rustc -vV` (never executed)`
# before borrow-check even starts. Clear it locally. Also clear
# SCCACHE_* so cargo doesn't try to contact the cache (the
# remote auth headers come from secrets that aren't present on
# this runner either). Lose the cache, keep the gate.
# The workflow-level env sets `RUSTC_WRAPPER: sccache`
# unconditionally, which hard-fails cargo if the CUDA image
# doesn't ship sccache. Clear it at job level; the "Enable
# sccache when available" step opts back in only after probing
# for the binary. SCCACHE_*/AWS creds stay set — harmless when
# the wrapper is off, required when it's on.
env:
RUSTC_WRAPPER: ""
SCCACHE_BUCKET: ""
SCCACHE_ENDPOINT: ""
SCCACHE_REGION: ""
SCCACHE_S3_USE_SSL: ""
AWS_ACCESS_KEY_ID: ""
AWS_SECRET_ACCESS_KEY: ""
# candle-kernels' build script falls back to `nvidia-smi` for
# compute-cap detection when this is unset — and the GPU-less
# builder image doesn't ship nvidia-smi. Any valid cap works for
# a borrow-check; the real per-flavour caps live in
# build-prerelease.yml's matrix.
CUDA_COMPUTE_CAP: "86"
steps:
- uses: actions/checkout@v4
- name: cargo check --features cuda (with retry)
# sccache probing + failure classification lives in the shared
# script (see build-prerelease.yml's neuron build for the same
# pattern). It probes for sccache and, on a rustc SIGSEGV / OOM,
# keeps the cache and fails fast rather than rebuilding uncached.
- name: cargo check --features cuda (sccache escalation)
run: |
# act launches the step shell without /etc/profile, so the
# gitea_runner user's inherited PATH lacks /usr/local/cuda-13.0/bin.
# cudarc's build.rs:157 shells out to `nvcc --version` (because
# the neuron crate enables cuda-version-from-build-system) and
# panics with ENOENT if nvcc isn't resolvable. build-prerelease.yml
# does the same export — keep them in sync.
# cudarc's build.rs shells out to `nvcc --version` (the neuron
# crate enables cuda-version-from-build-system) and panics with
# ENOENT if nvcc isn't resolvable — keep this export in sync
# with build-prerelease.yml.
export PATH="/usr/local/cuda-13.0/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LIBRARY_PATH:-}"
for attempt in 1 2 3; do
echo "::group::cuda-check attempt ${attempt}"
if cargo check -p neuron --features cuda --all-targets; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "cuda-check failed on attempt ${attempt}"
if [ "${attempt}" -lt 3 ]; then
sleep 5
fi
done
echo "cuda-check failed after 3 attempts"
exit 1
script/ci-cargo-escalate.sh cargo check -p neuron --features cuda --all-targets
srpm-cortex:
name: Build cortex SRPM
timeout-minutes: 25
runs-on: rpm
needs: [fmt, clippy, test, cuda-check]
if: startsWith(github.ref, 'refs/tags/v')
@@ -212,6 +180,7 @@ jobs:
srpm-neuron:
name: Build neuron SRPM
timeout-minutes: 25
runs-on: rpm
needs: [fmt, clippy, test, cuda-check]
if: startsWith(github.ref, 'refs/tags/v')
@@ -272,6 +241,7 @@ jobs:
copr-cortex:
name: Publish cortex to COPR
timeout-minutes: 60
runs-on: fedora-43
needs: srpm-cortex
steps:
@@ -289,6 +259,7 @@ jobs:
copr-neuron:
name: Publish neuron to COPR
timeout-minutes: 60
runs-on: fedora-43
needs: srpm-neuron
steps:
@@ -306,6 +277,7 @@ jobs:
bump-version:
name: Bump version in source
timeout-minutes: 15
runs-on: rust
needs: [copr-cortex, copr-neuron]
steps:
@@ -349,6 +321,6 @@ jobs:
echo "Nothing to commit for ${VERSION}"
else
git commit -m "chore: bump version to ${VERSION}"
git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/helexa/cortex.git"
git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/${{ github.repository }}.git"
git push origin HEAD:main
fi

View File

@@ -0,0 +1,136 @@
name: deploy-dev
# Fast-path iteration deploy for a SINGLE neuron host: build one CUDA
# flavour, copy the raw binary to the host, restart neuron.service.
# Skips the other two flavours, all RPM packaging, signing, repo
# publish, and dnf — push-to-testable drops from ~20 min to roughly
# one CUDA build plus a service restart.
#
# This is a DEV convenience, not a release path:
# - the binary lands at /usr/bin/neuron *outside* RPM ownership;
# the next regular deploy.yml run reconciles the host back to the
# packaged binary (dnf sees the newer RPM and reinstalls). `rpm -V
# helexa-neuron-<flavour>` flagging a modified /usr/bin/neuron in
# the interim is expected.
# - nothing is published; other hosts are untouched.
# - requires the `install` sudoers rule from
# asset/sudoers.d/neuron-host.conf (re-run script/infra-setup.sh
# after updating it).
#
# Trigger from the Gitea UI: Actions → deploy-dev → Run workflow,
# pick the target host. Defaults to the ref you dispatch from, so it
# works from feature branches without touching main.
on:
workflow_dispatch:
inputs:
target:
description: "neuron host to deploy to"
required: true
type: choice
options: [beast, benjy, quadbrat]
default: beast
# One dev deploy at a time; a newer dispatch for the same host wins.
concurrency:
group: deploy-dev-${{ inputs.target }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
CARGO_TERM_COLOR: "always"
jobs:
build:
name: Build neuron (${{ inputs.target }})
runs-on: cuda-13.0
outputs:
flavour: ${{ steps.map.outputs.flavour }}
steps:
- uses: actions/checkout@v4
# host → flavour → compute cap. Keep in sync with the
# build-neuron matrix in build-prerelease.yml and the
# deploy-neurons matrix in deploy.yml.
- id: map
run: |
case "${{ inputs.target }}" in
beast) flavour=blackwell cap=120 ;;
benjy) flavour=ada cap=89 ;;
quadbrat) flavour=ampere cap=86 ;;
*) echo "unknown target ${{ inputs.target }}"; exit 1 ;;
esac
echo "flavour=${flavour}" >> "$GITHUB_OUTPUT"
echo "cap=${cap}" >> "$GITHUB_OUTPUT"
- name: Build neuron with CUDA
run: |
set -eux
export PATH="/usr/local/cuda-13.0/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LIBRARY_PATH:-}"
cargo build --release -p neuron --features "cuda cudnn"
env:
CUDA_COMPUTE_CAP: ${{ steps.map.outputs.cap }}
CARGO_BUILD_JOBS: "8"
NVCC_THREADS: "4"
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/neuron artifacts/neuron-dev
file artifacts/neuron-dev
- uses: actions/upload-artifact@v3
with:
name: neuron-dev-${{ inputs.target }}
path: artifacts/neuron-dev
retention-days: 1
deploy:
name: Deploy to ${{ inputs.target }}
needs: build
runs-on: fedora-43
env:
DEPLOY_KEY: |
${{ secrets.RSYNC_SSH_KEY }}
TARGET_HOST: ${{ inputs.target }}.hanzalova.internal
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
"gitea_ci@${TARGET_HOST}" 'hostname -f'
- uses: actions/download-artifact@v3
with:
name: neuron-dev-${{ inputs.target }}
path: artifacts/
- name: Copy binary to host
run: |
scp artifacts/neuron-dev "gitea_ci@${TARGET_HOST}:/var/lib/gitea_ci/neuron-dev"
- name: Install binary and restart neuron.service
run: |
ssh "gitea_ci@${TARGET_HOST}" '
set -eu
if systemctl is-active --quiet neuron.service; then
sudo /usr/bin/systemctl stop neuron.service
fi
# Exact command form required by the sudoers rule in
# asset/sudoers.d/neuron-host.conf — change both together.
sudo /usr/bin/install -o root -g root -m 0755 /var/lib/gitea_ci/neuron-dev /usr/bin/neuron
# enable --now so a dev deploy also leaves the unit enabled
# for boot, consistent with deploy.yml.
sudo /usr/bin/systemctl enable --now neuron.service
rm -f /var/lib/gitea_ci/neuron-dev'
- name: Capture neuron.service startup journal
if: always()
run: |
sleep 10
ssh "gitea_ci@${TARGET_HOST}" \
'journalctl --unit neuron.service -I --no-pager'

448
.gitea/workflows/deploy.yml Normal file
View File

@@ -0,0 +1,448 @@
name: deploy
# Roll the freshly-published unstable RPMs onto the helexa fleet:
# cortex on the gateway, helexa-neuron-<flavour> on each neuron host,
# and helexa-bench on bob (the bench host).
#
# Triggered automatically after `build-prerelease` succeeds (by which
# point the new RPMs are live on rpm.lair.cafe/unstable), and also
# re-runnable manually from the Gitea UI.
#
# Each host self-gates: if dnf sees no newer package than what is
# installed, the service is left alone — no stop, no restart, no model
# cold-load. Combined with build-prerelease's change detection this
# means a docs- or gateway-only push never restarts the neurons (a
# neuron restart costs ~5 min of 27B cold-load, see issue #1).
#
# Per-host one-time setup (gitea_ci user, authorized_keys, scoped
# sudoers drop-in) lives in script/infra-setup.sh — run that once per
# host before this workflow can succeed.
on:
workflow_run:
workflows: [build-prerelease]
types: [completed]
workflow_dispatch:
# Serialize deploys. Overlapping runs would race on dnf metadata
# refresh and service-restart timing; queueing keeps the fleet
# predictable. Don't cancel an in-flight deploy — a half-applied dnf
# transaction is worse than a slightly stale deploy.
concurrency:
group: deploy
cancel-in-progress: false
env:
DEPLOY_KEY: |
${{ secrets.RSYNC_SSH_KEY }}
jobs:
deploy-cortex:
runs-on: fedora-43
# Two trigger paths: manual dispatch always runs; workflow_run
# only runs if the upstream `build-prerelease` actually succeeded.
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@hanzalova.internal 'hostname -f'
# Gating compares `rpm -q` against the packages.json manifest the
# publish job maintains — NOT unprivileged `dnf check-update`,
# which proved unreliable as the gitea_ci user (hung on metadata
# locks on one host, silently reported "no updates" on others).
# An unreadable/unparsable manifest fails open: deploy proceeds.
- name: Deploy cortex (skips when already current)
run: |
ssh gitea_ci@hanzalova.internal 'bash -s' <<'DEPLOY'
set -eu
pkg=cortex
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
echo "${pkg}-${installed} already current — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} — deploying"
if systemctl is-active --quiet cortex.service; then
sudo /usr/bin/systemctl stop cortex.service
fi
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y cortex
else
sudo /usr/bin/dnf install --refresh --allowerasing -y cortex
fi
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# fleet self-heals after a host reboot.
sudo /usr/bin/systemctl enable --now cortex.service
DEPLOY
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture cortex.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@hanzalova.internal \
'journalctl --unit cortex.service -I --no-pager'
deploy-neurons:
needs: [deploy-cortex]
runs-on: fedora-43
strategy:
# One neuron failing must not cancel the others. Cortex is up
# already; a partial neuron deploy is strictly better than
# rolling back to zero.
fail-fast: false
matrix:
include:
# load_timeout: how long to wait for default_models to finish
# loading after a restart. beast cold-loads Qwen3.6-27B Q6K
# TP=2 (~5-6 min typical, see #1); benjy/quadbrat load small
# single-GPU models in well under a minute.
#
# max_prompt_tokens: per-model context cap, written to the
# neuron.service.d/model.conf drop-in (NEURON_MAX_PROMPT_TOKENS).
# A change here restarts the neuron even with no new RPM. Values
# are VRAM-safe ceilings derived per model — see
# doc/context-limits.md. beast (Qwen3.6-27B, hybrid linear, 2x
# 32GB) has ample KV headroom; benjy (Qwen3-8B dense, ~6GB free)
# is VRAM-bound and stays at the default; quadbrat (Qwen3-1.7B)
# likewise conservative.
- host: beast.hanzalova.internal
flavour: blackwell
load_timeout: 900
max_prompt_tokens: 131072
- host: benjy.hanzalova.internal
flavour: ada
load_timeout: 300
max_prompt_tokens: 16384
- host: quadbrat.hanzalova.internal
flavour: ampere
load_timeout: 300
max_prompt_tokens: 16384
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@${{ matrix.host }} 'hostname -f'
# See deploy-cortex for why gating uses the publish manifest and
# not unprivileged `dnf check-update`.
- name: Deploy helexa-neuron-${{ matrix.flavour }} (skips when already current)
run: |
ssh gitea_ci@${{ matrix.host }} 'bash -s' <<'DEPLOY'
set -eu
pkg=helexa-neuron-${{ matrix.flavour }}
max_prompt_tokens="${{ matrix.max_prompt_tokens }}"
# ── Desired per-model systemd drop-in ─────────────────────────
# model.conf carries NEURON_MAX_PROMPT_TOKENS so the context cap
# is deterministic per host and rolled out (with a restart) by
# this workflow, not hand-edited. It sorts after local.conf, so a
# deploy-managed value wins over any manual local override of the
# same variable. See doc/context-limits.md.
conf=/etc/systemd/system/neuron.service.d/model.conf
config_changed=0
if [ -n "${max_prompt_tokens}" ]; then
desired=$(printf '%s\n%s\n%s\n%s' \
"# Managed by .gitea/workflows/deploy.yml - do not edit by hand." \
"# Per-model context cap; see doc/context-limits.md." \
"[Service]" \
"Environment=NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens}")
[ "${desired}" = "$(cat "${conf}" 2>/dev/null || true)" ] || config_changed=1
fi
# ── Package version gate (manifest rationale: see deploy-cortex) ──
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
pkg_changed=1
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
pkg_changed=0
fi
# Skip only when BOTH the package and the drop-in are unchanged —
# a context-cap change must restart the neuron even with no new RPM.
if [ "${pkg_changed}" -eq 0 ] && [ "${config_changed}" -eq 0 ]; then
echo "${pkg}-${installed} current; NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens:-<unset>} unchanged — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} pkg_changed=${pkg_changed} config_changed=${config_changed} — deploying"
# Write the drop-in (staged in gitea_ci's dir, installed root-owned).
if [ "${config_changed}" -eq 1 ]; then
printf '%s\n' "${desired}" > /var/lib/gitea_ci/model.conf
sudo /usr/bin/install -o root -g root -m 0644 -D /var/lib/gitea_ci/model.conf "${conf}"
rm -f /var/lib/gitea_ci/model.conf
echo "applied ${conf}: NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens}"
fi
if systemctl is-active --quiet neuron.service; then
sudo /usr/bin/systemctl stop neuron.service
fi
if [ "${pkg_changed}" -eq 1 ]; then
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y "${pkg}"
else
sudo /usr/bin/dnf install --refresh --allowerasing -y "${pkg}"
fi
fi
# daemon-reload picks up both a new unit (dnf) and the drop-in.
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# fleet self-heals after a host reboot.
sudo /usr/bin/systemctl enable --now neuron.service
# ── Post-deploy validation ────────────────────────────────
# A deploy only goes green if the neuron (a) finishes loading
# its default models and (b) answers a trivial prompt like an
# LLM should. Catches the class of bug where the binary
# starts fine but model load or inference is broken — which
# previously surfaced only when a human noticed. The wait
# polls /health activation (the structured source of the
# "loaded default model" journal line, plus per-model failure
# detail); the journal-capture step below still runs for
# forensics either way.
load_timeout=${{ matrix.load_timeout }}
echo "waiting for default models (timeout ${load_timeout}s)"
deadline=$(( $(date +%s) + load_timeout ))
health=""
while :; do
health=$(curl -fsS --max-time 5 http://localhost:13131/health 2>/dev/null || true)
state=$(printf %s "${health}" | python3 -c '
import json, sys
try:
print(json.load(sys.stdin).get("activation", {}).get("state", ""))
except Exception:
print("")
')
if [ "${state}" = "ready" ]; then
break
fi
if [ "$(date +%s)" -ge "${deadline}" ]; then
echo "FAIL: activation not ready within ${load_timeout}s (last state: ${state:-unreachable})"
exit 1
fi
sleep 10
done
model=$(printf %s "${health}" | python3 -c '
import json, sys
a = json.load(sys.stdin).get("activation", {})
failed = a.get("failed", [])
if failed:
for f in failed:
msg = "FAILED " + str(f.get("model_id")) + ": " + str(f.get("error", ""))[:400]
sys.stderr.write(msg + chr(10))
sys.exit(1)
completed = a.get("completed", [])
print(completed[0] if completed else "")
')
if [ -z "${model}" ]; then
echo "no default models configured — skipping LLM probe"
exit 0
fi
echo "LLM probe against ${model}"
probe_body=$(printf '{"model":"%s","messages":[{"role":"user","content":"Reply with exactly one word: pineapple"}],"max_tokens":512,"temperature":0}' "${model}")
resp=$(curl -fsS --max-time 180 -H "content-type: application/json" \
-d "${probe_body}" http://localhost:13131/v1/chat/completions) || {
echo "FAIL: probe request errored"
exit 1
}
if printf %s "${resp}" | grep -qi pineapple; then
echo "LLM probe passed"
else
echo "FAIL: probe response missing expected token"
printf %s "${resp}" | head -c 2000
echo
exit 1
fi
DEPLOY
- name: Ensure firewalld allows helexa-neuron
run: |
ssh gitea_ci@${{ matrix.host }} '
if ! sudo /usr/bin/firewall-cmd --query-service=helexa-neuron --quiet 2>/dev/null; then
sudo /usr/bin/firewall-cmd --add-service=helexa-neuron --permanent
sudo /usr/bin/firewall-cmd --reload
fi'
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture neuron.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@${{ matrix.host }} \
'journalctl --unit neuron.service -I --no-pager'
# helexa-bench is a separate package on a separate host (bob), and it
# only consumes the fleet's HTTP APIs — it has no deploy-ordering
# dependency on cortex or the neurons (the sweep loop is version-aware
# and picks up whatever each neuron reports whenever). So it runs
# alongside the cortex→neurons chain rather than after it.
deploy-bench:
runs-on: fedora-43
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@bob.hanzalova.internal 'hostname -f'
# See deploy-cortex for why gating uses the publish manifest and
# not unprivileged `dnf check-update`.
- name: Deploy helexa-bench (skips when already current)
run: |
ssh gitea_ci@bob.hanzalova.internal 'bash -s' <<'DEPLOY'
set -eu
pkg=helexa-bench
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
echo "${pkg}-${installed} already current — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} — deploying"
if systemctl is-active --quiet helexa-bench.service; then
sudo /usr/bin/systemctl stop helexa-bench.service
fi
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-bench
else
sudo /usr/bin/dnf install --refresh --allowerasing -y helexa-bench
fi
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# bench resumes collecting after a host reboot.
sudo /usr/bin/systemctl enable --now helexa-bench.service
# ── Post-deploy validation ────────────────────────────────
# The bench serves a read-only API on :13132 alongside the
# outbound sweep loop. Probe the API over localhost (bypasses
# firewalld) — catches a crash-on-start or a bad bind. Bail
# early if the unit drops out of active (Restart backoff).
echo "waiting for bench API on :13132"
deadline=$(( $(date +%s) + 30 ))
while :; do
if curl -fsS --max-time 5 http://localhost:13132/api/health >/dev/null 2>&1; then
echo "bench API healthy"
break
fi
if ! systemctl is-active --quiet helexa-bench.service; then
echo "FAIL: helexa-bench.service is not active"
systemctl --no-pager status helexa-bench.service | head -20 || true
exit 1
fi
if [ "$(date +%s)" -ge "${deadline}" ]; then
echo "FAIL: bench API not healthy within 30s"
exit 1
fi
sleep 3
done
DEPLOY
- name: Ensure firewalld allows helexa-bench
run: |
ssh gitea_ci@bob.hanzalova.internal '
if ! sudo /usr/bin/firewall-cmd --query-service=helexa-bench --quiet 2>/dev/null; then
sudo /usr/bin/firewall-cmd --add-service=helexa-bench --permanent
sudo /usr/bin/firewall-cmd --reload
fi'
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture helexa-bench.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@bob.hanzalova.internal \
'journalctl --unit helexa-bench.service -I --no-pager'
# Build the bench UI and publish it to the public nginx vhost on the
# gateway (https://bench.helexa.ai). The vhost + Let's Encrypt cert are
# one-time host setup (script/infra-setup.sh); this job just refreshes
# the static assets. nginx reverse-proxies /api to the bob API, so the
# SPA is built same-origin (no VITE_API_BASE). Independent of the other
# deploy jobs.
deploy-bench-ui:
runs-on: fedora-43
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Build UI
run: |
cd bench
npm ci
npm run build
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@hanzalova.internal 'hostname -f'
- name: Rsync built UI to gateway webroot
run: |
rsync --archive --compress --delete \
--rsync-path 'sudo rsync' \
bench/dist/ \
gitea_ci@hanzalova.internal:/var/www/bench.helexa.ai/

3
.gitignore vendored
View File

@@ -1,4 +1,6 @@
/target
/bench/node_modules
/bench/dist
*.swp
*.swo
.idea/
@@ -7,3 +9,4 @@ cortex.toml
models.toml
doc/plan/*
/target-cuda/
.claude/

268
AGENTS.md Normal file
View File

@@ -0,0 +1,268 @@
# AGENTS.md — helexa/cortex
## Project Overview
helexa is a self-hosted LLM serving stack for multi-node GPU inference clusters. It has two components:
- **cortex** — the per-operator control plane and LLM proxy. A Rust reverse-proxy that sits in front of the fleet and presents a unified OpenAI + Anthropic compatible API surface. It handles model routing, lifecycle management (load/unload/evict), request translation, and metrics collection.
- **neuron** — the per-host LLM harness. One instance runs on every GPU host, serving candle-based in-process inference and managing local hardware discovery and model lifecycle.
## Repository Layout
```
cortex/
├── Cargo.toml # workspace root (Rust 2024 edition, GPL-3.0)
├── cortex.example.toml # example gateway config
├── models.example.toml # example model catalogue
├── neuron.example.toml # example neuron config
├── README.md # public-facing documentation
├── CLAUDE.md # detailed design rationale and implementation history
├── AGENTS.md # ← you are here
├── cortex.spec # RPM spec for cortex
├── helexa-neuron.spec # RPM spec for neuron (renamed to avoid Fedora collision)
├── rpm/ # prerelease RPM specs
│ ├── cortex-prerelease.spec
│ ├── helexa-neuron-prerelease.spec
│ └── helexa-bench-prerelease.spec
├── data/ # systemd units and example configs for packaging
│ ├── cortex.service
│ ├── neuron.service
│ ├── cortex.example.toml
│ ├── neuron.example.toml
│ └── models.example.toml
└── crates/
├── cortex-core/ # shared types, config, envelopes
│ └── src/
│ ├── lib.rs
│ ├── build_info.rs # BuildInfo type for /version endpoint
│ ├── config.rs # figment-based config structs
│ ├── catalogue.rs # ModelProfile, placement matching
│ ├── discovery.rs # DeviceInfo, DiscoveryResponse
│ ├── harness.rs # Harness trait, HarnessConfig, HarnessHealth
│ ├── node.rs # NodeState, ModelStatus
│ ├── openai.rs # OpenAI request/response types
│ ├── anthropic.rs # Anthropic request/response types
│ ├── translate.rs # OpenAI <-> Anthropic translation
│ └── metrics.rs # RequestMetrics, histogram helpers
├── cortex-gateway/ # the HTTP proxy server
│ └── src/
│ ├── lib.rs
│ ├── state.rs # CortexState: Arc<RwLock<...>>
│ ├── router.rs # model -> node routing logic
│ ├── proxy.rs # streaming HTTP proxy to backends
│ ├── evictor.rs # LRU/priority eviction logic
│ ├── poller.rs # background task polling neuron status
│ ├── handlers.rs # axum handlers (chat, completions, models, etc.)
│ └── metrics.rs # prometheus exporter endpoint
├── cortex-cli/ # CLI entrypoint
│ └── src/main.rs # binary: `cortex`
├── neuron/ # per-host LLM daemon (replaces cortex-agent)
│ ├── Cargo.toml # features: cuda, cudnn, flash-attn, cuda-integration
│ ├── build.rs # compiles CUDA kernels, emits build metadata
│ └── src/
│ ├── main.rs # binary: `neuron`
│ ├── discovery.rs # nvidia-smi parsing, device enumeration
│ ├── health.rs # runtime GPU polling
│ ├── api.rs # HTTP handlers for /discovery, /models, etc.
│ ├── version.rs # GET /version endpoint with BuildInfo
│ ├── models.rs # local model lifecycle orchestration
│ └── harness/ # in-process candle inference
│ ├── device_worker/ # per-device CUDA worker threads
│ │ ├── mod.rs # canonical narrative for worker architecture
│ │ ├── jobs.rs # Job enum, dispatch handlers
│ │ └── dispatch.rs # DeviceWorkerState struct
│ ├── candle.rs # candle model implementation
│ └── tp/ # tensor parallelism
│ └── worker.rs # TP worker subprocesses
├── helexa-acp/ # Agent Client Protocol bridge (Apache-2.0)
│ └── src/main.rs # binary: `helexa-acp`, self-contained (no workspace deps)
└── helexa-bench/ # benchmark harness
└── src/main.rs # binary: `helexa-bench`, SQLite-backed, version-aware
```
## Key Design Decisions
### Architecture
- **cortex** is the control plane. It exposes the unified API, routes requests, manages model lifecycle across the fleet, and collects metrics.
- **neuron** is the node plane. One instance runs on every GPU host. It discovers local hardware, manages in-process candle inference, handles NCCL tensor parallelism, and reports runtime state.
- cortex never shells out to `nvidia-smi`, never touches systemd units, and never talks directly to a harness. It talks only to neurons via HTTP API on port 13131.
### Per-device worker thread (neuron)
Every CUDA device gets one dedicated OS thread that owns its `CudaContext` for the daemon's lifetime. All CUDA operations route through this thread via a `std::sync::mpsc` job channel. Tensors never escape the worker thread alive. Inference replies carry `Vec<f32>` CPU-side logits; sampled tokens come back as `u32`. The opaque `ArchHandle(u64)` and `TpHandle(u64)` are indices into the worker's state slab, not pointers.
CPU loads (`Device::Cpu` fallback) keep the legacy `tokio::task::spawn_blocking + Arc<Mutex<ModelArch>>` path — there's no context to own and the channel hop would only add latency. Four `spawn_blocking` references in `harness/candle.rs` are deliberate CPU fallback.
### candle-native (not mistral.rs)
neuron builds directly on [candle](https://github.com/huggingface/candle). Every model architecture it serves is implemented in this repository, ported against the HuggingFace reference. No external inference server to babysit. The Harness trait remains as an internal seam for adding future engines (vision/audio/diffusion) but its only implementation is in-process candle.
### Streaming proxy
Chat completions are proxied as SSE streams. The gateway must:
1. Parse the inbound request to extract the model name
2. Route to the correct backend neuron
3. Stream the response back, capturing token timing for metrics
4. NOT buffer the full response — true streaming passthrough
### Anthropic translation
When a request arrives at `/v1/messages` (Anthropic format), the gateway translates it to OpenAI format before proxying to neuron, then translates the response back. This is stateless envelope transformation. Non-streaming round-trip is implemented; streaming SSE translation deferred.
### Eviction
The evictor runs as a background task. Before loading a model on a node where VRAM is tight:
1. Check if the model is already loaded elsewhere → route there instead
2. Find the LRU model on the target node (excluding pinned models)
3. Call `POST {neuron}/models/unload` on that model
4. The incoming request's lazy-load triggers the new model load
### Metrics
Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port (31314).
## Tech Stack
- **Rust 2024 edition** — workspace with 6 crates
- **Axum 0.8** — HTTP framework
- **reqwest** — HTTP client for proxying to backends
- **figment** — config loading (TOML + env vars)
- **tokio** — async runtime
- **metrics + metrics-exporter-prometheus** — observability
- **tracing** — structured logging
- **candle** — in-process inference engine (neuron only, with CUDA support)
- **cudarc** — patched for neuron's needs (see workspace `[patch]`)
- **clap** — CLI parsing
- **rusqlite** (bundled) — helexa-bench SQLite system-of-record
## Build Commands
```sh
cargo build --release # build all crates
cargo run -p cortex-cli -- serve # run the gateway
cargo test # run all tests
cargo clippy --workspace # lint
```
### neuron Features
- `cuda`: Enables CUDA acceleration in candle and cudarc/nccl bindings. Without it, falls back to CPU.
- `cudnn`: Use cuDNN for convolution/attention kernels (requires `cuda`).
- `flash-attn`: FlashAttention kernels (requires `cuda`).
- `cuda-integration`: Reserved for GPU-only integration tests (requires multiple CUDA devices + libnccl).
### Build Scripts
- `neuron/build.rs`: Compiles CUDA kernels (`src/cuda/*.cu`) using `cudaforge::KernelBuilder` when `cuda` feature is enabled. Handles compute capability checks (sm_<80 disables bf16 intrinsics). Also captures build metadata: git SHA, dirty flag, timestamp, rustc version, profile, features, candle-core version.
## CI
Gitea Actions runs on every push to any branch. All three checks must pass before merging:
```sh
cargo fmt --check --all # formatting
cargo clippy --workspace -- -D warnings # lint (warnings are errors)
cargo test --workspace # tests
```
Run these locally before pushing. `cargo fmt --all` fixes formatting automatically. Clippy warnings must be resolved, not suppressed with `#[allow(...)]` unless there is a clear rationale.
Tagged releases (`v*`) build SRPMs for `cortex`, `helexa-neuron`, and `helexa-bench` and publish to COPR (`helexa/helexa`). Build metadata SHA injection: CI sets `HELEXA_BUILD_SHA=$(git rev-parse HEAD)`.
## Environment
- Targets Fedora 43 (systemd, SELinux enforcing)
- Nodes communicate over a private network (e.g. WireGuard mesh)
- cortex listens on port 31313 (API) and 31314 (metrics)
- neuron listens on port 13131 on each GPU host
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
## Conventions
- Error handling: `anyhow` for binaries, `thiserror` for library crates
- No `unwrap()` in library code; `expect()` only with clear rationale
- All public types derive `Debug, Clone, Serialize, Deserialize` where sensible
- Config structs use `figment` with TOML as primary source, env vars as override
- Prefer `Arc<RwLock<...>>` for shared fleet state; minimize lock duration
- SSE streaming uses `tokio_stream` + `eventsource-stream` for parsing
- Log at `info` for request routing, `debug` for proxy details, `warn` for eviction and node health, `error` for proxy failures
## Testing
### Gateway tests
Use mock neurons spawned via axum in `crates/cortex-gateway/tests/common/mod.rs`. Helpers: `spawn_mock_backend()`, `spawn_gateway()`.
### neuron integration tests
- Numerical reference tests (`numerical_reference.rs`) require `NEURON_REF_MODEL_PATH` env var pointing to a HF snapshot directory. Fixtures are f32-based for precision validation against HuggingFace transformers.
- CUDA integration tests (`tp_worker_lifecycle_cuda.rs`) gated behind `cuda-integration` feature; requires 2+ CUDA devices (e.g., 2x RTX 5090).
### Metrics testing
Use `install_test_recorder()` in test code to capture metrics without the HTTP listener.
## helexa-bench
A continuous, version-aware benchmark harness. Hits each neuron directly on `:13131`, exercises each warm model with a Scenario suite (chat-latency family), and records results into SQLite stamped with the neuron's full `BuildInfo`. The loop is version-aware: skips any (target, build SHA, model, scenario) cell already at `samples_per_version`.
Packaged as `helexa-bench` RPM (prebuilt-binary spec). One systemd unit, typically on the metrics host.
## helexa-acp
Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default. Intentionally self-contained: no workspace crate dependencies. Uses `agent-client-protocol` with `unstable_session_model` feature for Zed model picker support. Licensed Apache-2.0 (workspace is GPL-3.0).
## RPM Packaging
- `cortex.spec` — installs the `cortex` binary
- `helexa-neuron.spec` — installs the `neuron` binary under package name `helexa-neuron` (renamed to avoid Fedora's NEURON neural-simulation package collision)
- Systemd units in `data/cortex.service`, `data/neuron.service`
- Example configs: `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
Install:
```sh
dnf copr enable helexa/helexa
dnf install cortex # gateway host
dnf install helexa-neuron # GPU nodes
```
## Configuration Files
### cortex.toml (gateway)
```toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
```
### models.toml (catalogue)
```toml
[[models]]
id = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
harness = "candle"
quant = "Q4_K_M"
vram_mb = 19000
min_devices = 2
min_device_vram_mb = 10000
pinned_on = ["beast"] # optional: never evict from these neurons
```
### neuron.toml (per-host)
Configured via figment + env override. See `neuron.example.toml` for reference.
## neuron API Endpoints
```
GET /discovery → hardware discovery (hostname, OS, CUDA, devices, harnesses)
GET /health → runtime GPU stats (VRAM, utilization, temperature)
GET /models → loaded/unloaded models with VRAM usage
POST /models/load → load a model with spec (quant, TP, devices)
POST /models/unload → unload a model, freeing device memory
GET /models/{id}/endpoint → inference URL for a model
GET /version → build metadata (SHA, features, candle version, etc.)
```
## Sources of Truth
When prose documentation conflicts with code, trust:
1. Executable configuration (`*.toml`, `Cargo.toml` features)
2. Type definitions in `cortex-core/`
3. Test files in `crates/*/tests/` and `*/src/**/*_test.rs`
4. `CLAUDE.md` for historical design rationale

View File

@@ -1,16 +1,26 @@
# CLAUDE.md — cortex
# CLAUDE.md — helexa
## Project overview
cortex is a Rust reverse-proxy that sits in front of multiple
mistral.rs inference nodes and presents a unified OpenAI + Anthropic
compatible API surface. It handles model routing, lifecycle management
(load/unload/evict), request translation, and metrics collection.
helexa is a self-hosted LLM serving stack for multi-node GPU inference
clusters. It has two components:
- **cortex** — the per-operator control plane and LLM proxy. A Rust
reverse-proxy that sits in front of the fleet and presents a unified
OpenAI + Anthropic compatible API surface. It handles model routing,
lifecycle management (load/unload/evict), request translation, and
metrics collection.
- **neuron** — the per-host LLM harness. One instance runs on every GPU
host, serving candle-based in-process inference and managing local
hardware discovery and model lifecycle.
(Historical note: cortex originally proxied to mistral.rs nodes; neuron
replaced that — see the 2026-05-18 candle-native addendum below.)
## Repository layout
```
cortex/
helexa/
├── Cargo.toml # workspace root
├── cortex.toml # example gateway config
├── README.md
@@ -548,7 +558,7 @@ and the hardcoded `vram_mb` per node.
## Revised repository layout
```
cortex/
helexa/
├── Cargo.toml
├── cortex.toml # gateway config (neurons only)
├── models.toml # model catalogue
@@ -754,3 +764,39 @@ Landed in four PRs:
from Phases 2/3 deleted; `SendComm` newtype no longer needed in the
load path. `grep -rn spawn_blocking crates/neuron/src/harness/`
returns only deliberate CPU-fallback hits after this PR.
## 2026-06-13 addendum: build metadata + helexa-bench
Two coupled additions so fleet performance can be tracked automatically
across neuron updates instead of by hand-running `script/bench.py` and
editing `doc/benchmarks.md`.
**neuron build metadata + `GET /version`.** neuron's `build.rs` now also
captures build identity (`HELEXA_GIT_SHA` — preferring a CI/RPM-injected
`HELEXA_BUILD_SHA`, falling back to git, else `unknown` — plus dirty
flag, build timestamp, rustc version, profile, enabled cargo features,
and a best-effort `candle-core` version from `Cargo.lock`). These are
exposed as `cortex_core::build_info::BuildInfo` (new module) from a new
`GET /version` endpoint (`neuron/src/version.rs`, wired in `api.rs`) and
in clap's `--version` long form. The SHA is injected in CI
(`build-prerelease.yml` build-neuron step: `export HELEXA_BUILD_SHA=$(git
rev-parse HEAD)`) and via `--define helexa_commit` in the source-build
spec, so tarball-built RPMs report the real SHA. `/version` is now the
canonical "which build is live" probe (supersedes the per-host RPM-sha
check in the fleet-validation flow).
**`crates/helexa-bench`** — a new binary: a continuous, version-aware
benchmark harness (one systemd unit, typically on the metrics host). It
hits each neuron **directly** on `:13131`, exercises each **warm**
(`status == "loaded"`) model with an extensible `Scenario` suite (phase
1: the chat-latency family ported verbatim from `bench.py` — synthetic
128/4096-tok prompts, `/no_think`, streamed TTFT + decode-window
tok/s), and records each run into a SQLite system-of-record stamped with
the neuron's full `BuildInfo`. The loop is **version-aware**: it skips
any (target, build SHA, model, scenario) cell already at
`samples_per_version`, so a steady fleet costs only cheap `/version` +
`/models` polls until a new SHA ships. `helexa-bench report` regenerates
the `benchmarks.md`-style table from the DB. `kind = "openai"` targets
(mistral.rs/llama.cpp comparison) are scaffolded but not yet wired.
Packaged as the `helexa-bench` RPM (prebuilt-binary spec, outbound-only
so no firewalld service) via the same `build-prerelease.yml` pipeline.

213
Cargo.lock generated
View File

@@ -472,6 +472,12 @@ version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
[[package]]
name = "byteorder-lite"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8f1fe948ff07f4bd06c30984e69f5b4899c516a3ef74f34df92a2df2ab535495"
[[package]]
name = "bytes"
version = "1.11.1"
@@ -668,6 +674,12 @@ dependencies = [
"cc",
]
[[package]]
name = "color_quant"
version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3d7b894f5411737b7867f4827955924d7c254fc9f4d91a6aad6b097804b1018b"
[[package]]
name = "colorchoice"
version = "1.0.5"
@@ -893,8 +905,7 @@ dependencies = [
[[package]]
name = "cudarc"
version = "0.19.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1cea5f10a99e025c1b44ae2354c2d8326b25ddbd0baf76bde8e55cfd4018a2cc"
source = "git+https://github.com/grenade/cudarc?rev=63327a256059f8252641ae46c6bb9eefe707f382#63327a256059f8252641ae46c6bb9eefe707f382"
dependencies = [
"float8",
"half",
@@ -1206,6 +1217,18 @@ dependencies = [
"pin-project-lite",
]
[[package]]
name = "fallible-iterator"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2acce4a10f12dc2fb14a218589d4f1f62ef011b2d0cc4b3cb1bba8e94da14649"
[[package]]
name = "fallible-streaming-iterator"
version = "0.1.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7360491ce676a36bf9bb3c56c1aa791658183a54d2744120f27285738d90465a"
[[package]]
name = "fancy-regex"
version = "0.17.0"
@@ -1223,6 +1246,15 @@ version = "2.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6"
[[package]]
name = "fdeflate"
version = "0.3.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e6853b52649d4ac5c0bd02320cddc5ba956bdb407c4b75a2c6b75bf51500f8c"
dependencies = [
"simd-adler32",
]
[[package]]
name = "figment"
version = "0.10.19"
@@ -1230,8 +1262,10 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8cb01cd46b0cf372153850f4c6c272d9cbea2da513e07538405148f95bd789f3"
dependencies = [
"atomic",
"parking_lot",
"pear",
"serde",
"tempfile",
"toml",
"uncased",
"version_check",
@@ -1731,6 +1765,16 @@ dependencies = [
"wasip3",
]
[[package]]
name = "gif"
version = "0.14.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ee8cfcc411d9adbbaba82fb72661cc1bcca13e8bba98b364e62b2dba8f960159"
dependencies = [
"color_quant",
"weezl",
]
[[package]]
name = "glob"
version = "0.3.3"
@@ -1777,6 +1821,15 @@ version = "0.12.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8a9ee70c43aaf417c914396645a0fa852624801b24ebb7ae78fe8272889ac888"
[[package]]
name = "hashbrown"
version = "0.14.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e5274423e17b7c9fc20b6e7e208532f9b19825d82dfd615708b70edd83df41f1"
dependencies = [
"ahash",
]
[[package]]
name = "hashbrown"
version = "0.15.5"
@@ -1805,6 +1858,15 @@ version = "0.17.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4f467dd6dccf739c208452f8014c75c18bb8301b050ad1cfb27153803edb0f51"
[[package]]
name = "hashlink"
version = "0.9.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6ba4ff7128dee98c7dc9794b6a411377e1404dba1c97deb8d1a55297bd25d8af"
dependencies = [
"hashbrown 0.14.5",
]
[[package]]
name = "heck"
version = "0.5.0"
@@ -1835,6 +1897,30 @@ dependencies = [
"url",
]
[[package]]
name = "helexa-bench"
version = "0.1.16"
dependencies = [
"anyhow",
"async-trait",
"axum",
"chrono",
"clap",
"cortex-core",
"eventsource-stream",
"figment",
"futures",
"reqwest",
"rusqlite",
"serde",
"serde_json",
"tokio",
"tokio-stream",
"tower-http",
"tracing",
"tracing-subscriber",
]
[[package]]
name = "hermit-abi"
version = "0.5.2"
@@ -2135,6 +2221,34 @@ dependencies = [
"icu_properties",
]
[[package]]
name = "image"
version = "0.25.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "85ab80394333c02fe689eaf900ab500fbd0c2213da414687ebf995a65d5a6104"
dependencies = [
"bytemuck",
"byteorder-lite",
"color_quant",
"gif",
"image-webp",
"moxcms",
"num-traits",
"png",
"zune-core",
"zune-jpeg",
]
[[package]]
name = "image-webp"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "525e9ff3e1a4be2fbea1fdf0e98686a6d98b4d8f937e1bf7402245af1909e8c3"
dependencies = [
"byteorder-lite",
"quick-error",
]
[[package]]
name = "indexmap"
version = "1.9.3"
@@ -2299,6 +2413,17 @@ dependencies = [
"libc",
]
[[package]]
name = "libsqlite3-sys"
version = "0.30.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2e99fb7a497b1e3339bc746195567ed8d3e24945ecd636e3619d20b9de9e9149"
dependencies = [
"cc",
"pkg-config",
"vcpkg",
]
[[package]]
name = "linux-raw-sys"
version = "0.12.1"
@@ -2449,6 +2574,16 @@ dependencies = [
"serde_json",
]
[[package]]
name = "minijinja-contrib"
version = "2.20.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "99df5123c54391e2a228014c1dbbd85a3dab08a25e776c810526f2f47542b3de"
dependencies = [
"minijinja",
"serde",
]
[[package]]
name = "minimal-lexical"
version = "0.2.1"
@@ -2498,6 +2633,16 @@ dependencies = [
"syn",
]
[[package]]
name = "moxcms"
version = "0.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bb85c154ba489f01b25c0d36ae69a87e4a1c73a72631fc6c0eb6dde34a73e44b"
dependencies = [
"num-traits",
"pxfm",
]
[[package]]
name = "native-tls"
version = "0.2.18"
@@ -2522,6 +2667,7 @@ dependencies = [
"anyhow",
"async-trait",
"axum",
"base64 0.22.1",
"candle-core",
"candle-nn",
"candle-transformers",
@@ -2533,7 +2679,10 @@ dependencies = [
"futures",
"half",
"hf-hub",
"image",
"minijinja",
"minijinja-contrib",
"rayon",
"reqwest",
"safetensors 0.7.0",
"serde",
@@ -2861,6 +3010,19 @@ version = "0.3.33"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "19f132c84eca552bf34cab8ec81f1c1dcc229b811638f9d283dceabe58c5569e"
[[package]]
name = "png"
version = "0.18.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "60769b8b31b2a9f263dae2776c37b1b28ae246943cf719eb6946a1db05128a61"
dependencies = [
"bitflags",
"crc32fast",
"fdeflate",
"flate2",
"miniz_oxide",
]
[[package]]
name = "polling"
version = "3.11.0"
@@ -2974,6 +3136,12 @@ version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "40e24eee682d89fb193496edf918a7f407d30175b2e785fe057e4392dfd182e0"
[[package]]
name = "pxfm"
version = "0.1.29"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e0c5ccf5294c6ccd63a74f1565028353830a9c2f5eb0c682c355c471726a6e3f"
[[package]]
name = "quanta"
version = "0.12.6"
@@ -2989,6 +3157,12 @@ dependencies = [
"winapi",
]
[[package]]
name = "quick-error"
version = "2.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a993555f31e5a609f617c12db6250dedcac1b0a85076912c436e6fc9b2c8e6a3"
[[package]]
name = "quinn"
version = "0.11.9"
@@ -3324,6 +3498,20 @@ dependencies = [
"syn",
]
[[package]]
name = "rusqlite"
version = "0.32.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7753b721174eb8ff87a9a0e799e2d7bc3749323e773db92e0984debb00019d6e"
dependencies = [
"bitflags",
"fallible-iterator",
"fallible-streaming-iterator",
"hashlink",
"libsqlite3-sys",
"smallvec",
]
[[package]]
name = "rustc-hash"
version = "2.1.2"
@@ -4627,6 +4815,12 @@ dependencies = [
"rustls-pki-types",
]
[[package]]
name = "weezl"
version = "0.1.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a28ac98ddc8b9274cb41bb4d9d4d5c425b6020c50c46f25559911905610b4a88"
[[package]]
name = "which"
version = "7.0.3"
@@ -5164,3 +5358,18 @@ name = "zmij"
version = "1.0.21"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa"
[[package]]
name = "zune-core"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cb8a0807f7c01457d0379ba880ba6322660448ddebc890ce29bb64da71fb40f9"
[[package]]
name = "zune-jpeg"
version = "0.5.15"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "27bc9d5b815bc103f142aa054f561d9187d191692ec7c2d1e2b4737f8dbd7296"
dependencies = [
"zune-core",
]

View File

@@ -6,13 +6,14 @@ members = [
"crates/cortex-cli",
"crates/neuron",
"crates/helexa-acp",
"crates/helexa-bench",
]
[workspace.package]
version = "0.1.16"
edition = "2024"
license = "GPL-3.0-or-later"
repository = "https://git.lair.cafe/helexa/cortex"
repository = "https://git.lair.cafe/helexa/helexa"
[workspace.dependencies]
# async runtime
@@ -61,3 +62,12 @@ eventsource-stream = "0.2"
# workspace crates
cortex-core = { path = "crates/cortex-core" }
cortex-gateway = { path = "crates/cortex-gateway" }
# Patched cudarc (affects neuron's 0.19.x only; candle's 0.17.x is
# untouched since the fork is 0.19.7 and doesn't satisfy a 0.17 req). Adds
# Comm::abort / get_async_error / raw comm() — needed for #17 Stage 2 TP
# hang-recovery (abort a wedged collective from another thread, then
# rebuild the comm). Pinned to a fork revision pending upstream review
# (grenade/cudarc @ nccl-comm-abort).
[patch.crates-io]
cudarc = { git = "https://github.com/grenade/cudarc", rev = "63327a256059f8252641ae46c6bb9eefe707f382" }

190
README.md
View File

@@ -1,25 +1,68 @@
# cortex
# helexa
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more `neuron` daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
**Near-frontier AI for mortals.**
## Problem
helexa is a self-hosted LLM serving stack, written in Rust, for people
who run open-weight models on their own consumer GPUs. It has two
components:
Running local LLMs across multiple GPU nodes (different VRAM tiers, different
model affinities) requires a unified API surface that:
- **cortex** — the per-operator control plane and LLM proxy. It sits in
front of your GPU fleet and presents a unified OpenAI + Anthropic
compatible API surface, handling model routing, lifecycle management
(load / unload / evict), request translation, and metrics.
- **neuron** — the per-host LLM harness. One instance runs on every GPU
host, serving candle-based in-process inference and managing local
hardware discovery and model lifecycle.
- Presents a **single `/v1/models` catalogue** merging every model that can be
served by any neuron in the fleet.
- **Routes requests** to the correct node based on where a model is loaded
(or can be loaded), handling cold-load and eviction transparently.
- Manages **model lifecycle** — load on demand, unload cold models, pin
critical ones — by calling each neuron's `/models/{load,unload}` API.
- Translates between **OpenAI and Anthropic** request/response envelopes so
every client speaks whichever dialect it prefers.
- Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes
them as Prometheus counters/histograms.
## Why
Two principles constrain everything in this repository:
1. **Frontier or close to it.** helexa serves the open-weight models
that get nearest to frontier capability — not every architecture
ever published.
2. **Consumer hardware.** Everything must run on the cards mortals can
actually buy: a 3060 here, a 4090 there, a 5090 if you got lucky.
Mixed VRAM tiers across mismatched boxes are the expected topology,
not a degraded case.
GPU acquisition is harder than it was a year ago, and the gap between
what cloud providers charge and what your own silicon costs keeps
widening. The intersection of those two principles — near-frontier
models, squeezed onto hardware you own — is helexa's entire niche.
The secondary objective is **predictable consumption**. If you own the
hardware, your tooling shouldn't break because a cloud provider changed
billing, deprecated a model, or reshaped an API. cortex's OpenAI and
Anthropic surfaces are a stability contract: point your editor, agent,
or CLI at it once, and it keeps working.
## What helexa is not
This is an intentionally different path from vLLM, SGLang, and peers —
not a smaller version of them. Out of scope, permanently:
- Any-model breadth. Architectures are ported because they're at or
near the frontier, not to complete a compatibility matrix.
- Datacenter-class scheduling. No sophisticated continuous-batching /
paged-attention machinery — the workload is a handful of operators
and their agents, not 200 QPS.
- Wrapping external inference engines. neuron builds directly on
[candle](https://github.com/huggingface/candle); every model
architecture it serves is implemented in this repository, ported
against the HuggingFace reference.
One thing that is *not* a principle: CUDA exclusivity. All high-end
consumer hardware is in scope. helexa is CUDA-only today because
that's the hardware on the bench — nothing ships untested — and ROCm
or other consumer accelerators join as soon as there's real hardware
to build against.
In scope, and where the engineering effort goes: aggressive
quantization (GGUF Q4_K_M / Q6_K / Q8_0), NCCL tensor parallelism
across heterogeneous consumer GPUs, careful CUDA failure handling, and
single-request latency — the performance that one operator at a
keyboard actually feels.
## Architecture
@@ -29,7 +72,7 @@ model affinities) requires a unified API surface that:
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
OpenAI + Anthropic APIs
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
@@ -46,40 +89,59 @@ model affinities) requires a unified API surface that:
private network (.internal)
```
cortex discovers each neuron's hardware (devices, VRAM, compute
capability) at runtime and matches it against a model catalogue
(`models.toml`) to decide placement: which models fit where, what to
evict when VRAM is tight, where to route a request right now. Adding a
GPU host to the fleet is one `[[neurons]]` entry — no device specs in
config.
### Crates
| Crate | Purpose |
|---|---|
| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
| `neuron` | Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
| `neuron` | Per-host daemon: GPU discovery, in-process candle inference, NCCL tensor parallelism, model lifecycle API |
| `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) |
| `helexa-acp` | Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default |
## Node setup
## The engine
Each GPU node runs `neuron` (listening on `:13131`). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
neuron runs inference in-process on candle — there is no external
inference server to babysit. The parts that earn their keep:
Inside the daemon, every CUDA device gets one dedicated OS thread
(named `cuda-dev-N`) that owns the device's CUDA context for the
daemon's lifetime. Model loads, forward passes, KV-cache resets,
NCCL collectives, VRAM queries, and unloads all route through that
thread via a job channel; tensors never escape it alive. This pins
context binding to a known thread, makes the CUDA Drop contract
structurally safe, and isolates driver-error poisoning to one worker
rather than the whole process. See `CLAUDE.md` for the design
rationale and `crates/neuron/src/harness/device_worker/` for the code.
- **Per-device worker threads.** Every CUDA device gets one dedicated
OS thread that owns its CUDA context for the daemon's lifetime. All
loads, forward passes, KV-cache resets, NCCL collectives, VRAM
queries, and unloads route through it; tensors never escape it
alive. Context binding is pinned to a known thread, the CUDA `Drop`
contract is structurally safe, and a driver error poisons one worker
— visibly — instead of hanging the whole process.
- **Tensor parallelism on consumer cards.** Megatron-style row/column
parallel layers with NCCL all-reduce, spanning the mismatched GPUs
you actually have. A step watchdog aborts wedged collectives instead
of letting a request hang forever.
- **Current model focus: the Qwen3 family** — dense and GGUF-quantized,
including the hybrid linear-attention (Gated DeltaNet) generation.
Vision support is in progress. Each architecture is ported against
its HuggingFace reference implementation.
The neuron RPM (`helexa-neuron`) ships a systemd unit:
See `CLAUDE.md` for design rationale and
`crates/neuron/src/harness/device_worker/` for the worker narrative.
## Install
Pre-built RPMs for Fedora:
```sh
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
dnf install cortex # on the gateway host
dnf install helexa-neuron # on each GPU host
systemctl enable --now cortex # or neuron, respectively
```
## Gateway config
## Configure
```toml
# /etc/cortex/cortex.toml
@@ -100,29 +162,10 @@ name = "benjy"
endpoint = "http://benjy.internal:13131"
```
Model placement profiles live in `models.toml` — see `models.example.toml`.
Model placement profiles (VRAM requirements, quant, device minimums,
pinning) live in `models.toml` — see `models.example.toml`.
## Building
```sh
cargo build --release
```
## CI
Every push triggers format, lint, and test checks. Ensure these pass
locally before pushing:
```sh
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
```
Tagged releases (`v*`) additionally build SRPMs for both `cortex` and
`helexa-neuron` and publish to COPR.
## Running
## Run
```sh
# start the gateway
@@ -131,10 +174,37 @@ cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
# one catalogue across every node
curl http://localhost:31313/v1/models
```
## Build from source
```sh
cargo build --release
```
CI runs on every push; keep it green locally:
```sh
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
```
Tagged releases (`v*`) build SRPMs for `cortex` and `helexa-neuron`
and publish to COPR.
## Status
Pre-1.0 and moving fast. The gateway path (routing, eviction,
translation, metrics) is stable and tested; the candle-native engine
is under active development — expect the supported-model list to track
the open-weight frontier, deliberately narrowly.
Development happens at <https://git.lair.cafe/helexa/helexa>;
<https://github.com/helexa-ai/helexa> is a read-only mirror.
## License
GPL-3.0

View File

@@ -0,0 +1,38 @@
# helexa-bench config for bob.hanzalova.internal.
#
# Synced to /etc/helexa-bench/helexa-bench.toml by script/infra-setup.sh
# (the helexa-bench RPM ships helexa-bench.example.toml as a
# %config(noreplace) default; this per-host file overrides it).
#
# bob is a client host (it also runs Agent Zero); helexa-bench here hits
# every neuron on the fleet directly and records build-stamped results
# into the local SQLite store.
[bench]
sweep_interval_secs = 1800
samples_per_version = 5
iteration_pause_secs = 2
request_timeout_secs = 600
db_path = "/var/lib/helexa-bench/bench.sqlite"
[scenarios]
prompt_sizes = [128, 4096]
max_tokens = 256
# Read-only JSON API consumed by the bench UI (hosted separately) and for
# programmatic access. Served alongside the sweep loop.
[api]
enabled = true
listen = "0.0.0.0:13132"
[[targets]]
name = "beast"
endpoint = "http://beast.hanzalova.internal:13131"
[[targets]]
name = "benjy"
endpoint = "http://benjy.hanzalova.internal:13131"
[[targets]]
name = "quadbrat"
endpoint = "http://quadbrat.hanzalova.internal:13131"

View File

@@ -1,30 +0,0 @@
# Helexa fleet manifest.
#
# Drives rolling deploys via script/deploy.sh and serves as the source
# of truth for which hosts run cortex vs neuron, and which CUDA
# compute-capability flavour each neuron host needs.
#
# Flavour ↔ NVIDIA generation ↔ compute cap:
# ampere sm_86 (RTX 30 series — e.g. 3060)
# ada sm_89 (RTX 40 series — e.g. 4090)
# blackwell sm_120 (RTX 50 series — e.g. 5090)
#
# The flavour determines which RPM is installed on a given neuron host:
# helexa-neuron-<flavour>. Only one flavour may be installed at a time
# (the packages Conflict: with each other).
cortex:
host: hanzalova.internal
neurons:
- host: beast.hanzalova.internal
flavour: blackwell
gpu: "2x RTX 5090"
- host: benjy.hanzalova.internal
flavour: ada
gpu: "RTX 4090"
- host: quadbrat.hanzalova.internal
flavour: ampere
gpu: "RTX 3060"

View File

@@ -5,9 +5,9 @@
# invocation: `validate-neuron.sh beast.hanzalova.internal
# Qwen/Qwen3.6-27B q5k 2`.
#
# Synced by script/deploy.sh from asset/neuron/<short-host>.toml. Edits
# take effect on the next deploy.sh run (which stops + restarts the
# service so default_models is re-read at activation).
# Synced to /etc/neuron/neuron.toml by script/infra-setup.sh. Edits
# take effect after the next deploy workflow run restarts the service
# (default_models is read at activation).
port = 13131

View File

@@ -4,7 +4,7 @@
# Qwen3-8B (bf16, ~18 GB), leaving ~6 GB for KV cache + activations on
# moderate-length contexts.
#
# Synced by script/deploy.sh from asset/neuron/<short-host>.toml.
# Synced to /etc/neuron/neuron.toml by script/infra-setup.sh.
port = 13131

View File

@@ -4,7 +4,7 @@
# (bf16, ~4 GB), leaving ~7 GB for KV cache so long contexts on a small
# model still have plenty of room.
#
# Synced by script/deploy.sh from asset/neuron/<short-host>.toml.
# Synced to /etc/neuron/neuron.toml by script/infra-setup.sh.
port = 13131

View File

@@ -0,0 +1,15 @@
# Bootstrap vhost for bench.helexa.ai — http-only, used ONLY to obtain
# the initial Let's Encrypt cert via the webroot challenge (the full TLS
# vhost can't load before the cert file exists). script/infra-setup.sh
# installs this, runs certbot, then swaps in bench.helexa.ai.conf.
server {
listen 80;
server_name bench.helexa.ai;
location /.well-known/acme-challenge/ {
root /var/www/bench.helexa.ai;
}
location / {
try_files $uri $uri/ =404;
}
}

View File

@@ -0,0 +1,56 @@
# Public, auth-less bench UI at https://bench.helexa.ai.
#
# Serves the static SPA from /var/www/bench.helexa.ai (rsynced by
# .gitea/workflows/deploy.yml's deploy-bench-ui job) and reverse-proxies
# /api to the helexa-bench read API on bob over the WireGuard mesh — so
# the browser stays same-origin (no CORS) and the internal API never
# needs to be exposed publicly.
#
# TLS via Let's Encrypt; the cert is obtained/renewed by certbot
# (bootstrapped one-time in script/infra-setup.sh). Mirrors the
# dev.swym.hanzalova.internal vhost convention on this host.
server {
listen 80;
server_name bench.helexa.ai;
# Keep serving the ACME webroot so certbot can renew.
location /.well-known/acme-challenge/ {
root /var/www/bench.helexa.ai;
}
location / {
return 301 https://$host$request_uri;
}
}
server {
listen 443 ssl;
http2 on;
server_name bench.helexa.ai;
ssl_certificate /etc/letsencrypt/live/bench.helexa.ai/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/bench.helexa.ai/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
root /var/www/bench.helexa.ai;
index index.html;
# Bench read API on bob (internal WireGuard); browser stays same-origin.
location /api/ {
proxy_pass http://bob.hanzalova.internal:13132;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
# SPA fallback — client-side routes (/trends, /runs) resolve to index.html.
location / {
try_files $uri $uri/ /index.html;
}
}

View File

@@ -0,0 +1,34 @@
# Internal bench UI vhost — https://bench.internal, reachable from inside
# the WireGuard mesh (the public bench.helexa.ai dead-ends at the OPNsense
# LAN interface, which only port-forwards :443 from the WAN). Same SPA +
# /api→bob proxy as bench.helexa.ai, but with an internal-CA cert
# (smallstep "lair", renewed by step@bench.timer). Mirrors the
# *.internal vhost convention on oolon.kosherinata.internal.
server {
server_name bench.internal;
listen 443 ssl;
http2 on;
ssl_certificate /etc/nginx/tls/cert/bench.internal.pem;
ssl_certificate_key /etc/nginx/tls/key/bench.internal.pem;
ssl_trusted_certificate /etc/pki/ca-trust/source/anchors/root-internal.pem;
ssl_protocols TLSv1.3;
# Shared webroot with the public vhost — same built SPA.
root /var/www/bench.helexa.ai;
index index.html;
location /api/ {
proxy_pass http://bob.hanzalova.internal:13132;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
location / {
try_files $uri $uri/ /index.html;
}
}

View File

@@ -0,0 +1,25 @@
# Install on the bench host (bob) as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@bob to roll out helexa-bench package upgrades
# and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
#
# helexa-bench polls the neuron fleet (outbound) and serves a read-only
# JSON API on tcp/13132 for the bench UI — hence the firewall-cmd grants.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/helexa-bench/helexa-bench.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-bench
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-bench
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=helexa-bench --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload

View File

@@ -0,0 +1,23 @@
# Install on the cortex gateway host as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@<gateway> to roll out cortex package upgrades
# and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/cortex.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/models.toml
# deploy-bench-ui rsyncs the built bench SPA into the nginx webroot.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /var/www/bench.helexa.ai/
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y cortex
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y cortex
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1

View File

@@ -0,0 +1,43 @@
# Install on every neuron host as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@<neuron-host> to roll out helexa-neuron-<flavour>
# package upgrades and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
#
# All three CUDA flavours are listed because a host's flavour can change
# (e.g. GPU swap) and we don't want the sudoers file to need to change
# in lockstep. Only one flavour can be installed at a time (the packages
# Conflict: with each other), so the attack surface is bounded to "wrong
# flavour installed" — vandalism, not privilege escalation.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/neuron/neuron.toml
# deploy.yml writes the per-model systemd drop-in carrying
# NEURON_MAX_PROMPT_TOKENS: gitea_ci stages it in its own dir, then
# installs it root-owned. Exact source/dest paths; see doc/context-limits.md.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/install -o root -g root -m 0644 -D /var/lib/gitea_ci/model.conf /etc/systemd/system/neuron.service.d/model.conf
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ampere
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ampere
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ada
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ada
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-blackwell
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-blackwell
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install -y libcudnn9-cuda-13
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=helexa-neuron --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload
# deploy-dev.yml fast path: install a freshly-built dev binary over the
# packaged one. Exact source path + args; the workflow must use this
# command form verbatim. The next deploy.yml run reconciles the host
# back to the RPM-owned binary.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/install -o root -g root -m 0755 /var/lib/gitea_ci/neuron-dev /usr/bin/neuron

View File

@@ -0,0 +1,20 @@
# Internal-CA cert renewal for %i.internal, driven by step@%i.timer.
# Replicated from oolon.kosherinata.internal (the kosherinata DC proxy).
# Renews an EXISTING cert via mTLS (step ca renew) — the initial cert
# must be issued once with a provisioner (see script/infra-setup.sh).
# Installed to /etc/systemd/system/step@.service.
[Unit]
Description=step cert renew for %i.internal
Documentation=https://smallstep.com/docs/step-ca/renewal
[Service]
Type=oneshot
ExecCondition=/usr/bin/step certificate needs-renewal \
/etc/nginx/tls/cert/%i.internal.pem
ExecStart=/usr/bin/step ca renew \
--force \
--ca-url https://ca.internal \
--root /etc/pki/ca-trust/source/anchors/root-internal.pem \
/etc/nginx/tls/cert/%i.internal.pem \
/etc/nginx/tls/key/%i.internal.pem
ExecStartPost=/usr/bin/systemctl reload nginx.service

15
asset/systemd/step@.timer Normal file
View File

@@ -0,0 +1,15 @@
# Periodic internal-cert renewal for %i.internal (every 15 min, jittered).
# Replicated from oolon.kosherinata.internal. Installed to
# /etc/systemd/system/step@.timer; enable per-cert with
# `systemctl enable --now step@bench.timer`.
[Unit]
Description=step cert renew timer for %i.internal
[Timer]
Persistent=true
OnCalendar=*:1/15
AccuracySec=1us
RandomizedDelaySec=5m
[Install]
WantedBy=timers.target

3
bench/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
node_modules
dist
*.local

45
bench/README.md Normal file
View File

@@ -0,0 +1,45 @@
# helexa bench UI
A Vite + React (SWC, TypeScript) app that visualises the fleet benchmark
data collected by `helexa-bench`. It reads the read-only JSON API the
bench daemon serves (`crates/helexa-bench/src/api.rs`, default
`:13132` on bob).
Stack: React Router, react-bootstrap, Recharts.
## Pages
- **Overview** — latest median results per (host, model, scenario) cell.
- **Trends** — decode-tok/s and TTFT plotted across neuron build SHAs as
releases roll out (the headline view). Pick host / model / scenario.
- **Runs** — filterable raw-run explorer.
## Develop
```sh
cd bench
npm install
npm run dev # http://localhost:5173
```
`vite.config.ts` proxies `/api``http://bob.hanzalova.internal:13132`,
so the dev server talks to the live bench API with no CORS fuss. Point
the proxy elsewhere (or run a local `helexa-bench serve`) to develop
against other data.
## Production hosting
Public at **https://bench.helexa.ai** — nginx on the gateway
(`hanzalova.internal`) serves the static `dist/` and reverse-proxies
`/api` to the bench API on bob over WireGuard, so the SPA is same-origin
(no CORS) and the internal API stays off the public internet.
- `npm run build` is run with **no** `VITE_API_BASE` (the app calls
`/api/...` on its own origin; nginx proxies it to bob).
- `.gitea/workflows/deploy.yml` (`deploy-bench-ui`) builds and rsyncs
`dist/` to `/var/www/bench.helexa.ai` on every deploy.
- The nginx vhost (`asset/nginx/bench.helexa.ai.conf`) and the
Let's Encrypt cert are one-time host setup in `script/infra-setup.sh`.
To host elsewhere instead, build with
`VITE_API_BASE=<bob-api-origin>` and serve the static `dist/`.

12
bench/index.html Normal file
View File

@@ -0,0 +1,12 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>helexa bench</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

2191
bench/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

28
bench/package.json Normal file
View File

@@ -0,0 +1,28 @@
{
"name": "helexa-bench-ui",
"private": true,
"version": "0.1.0",
"type": "module",
"description": "Visualisation app for helexa-bench fleet benchmark data.",
"scripts": {
"dev": "vite",
"build": "tsc && vite build",
"preview": "vite preview"
},
"dependencies": {
"bootstrap": "^5.3.3",
"react": "^18.3.1",
"react-bootstrap": "^2.10.5",
"react-dom": "^18.3.1",
"react-router-dom": "^6.26.2",
"recharts": "^2.12.7"
},
"devDependencies": {
"@types/node": "^20.14.0",
"@types/react": "^18.3.5",
"@types/react-dom": "^18.3.0",
"@vitejs/plugin-react-swc": "^3.7.0",
"typescript": "^5.5.4",
"vite": "^5.4.0"
}
}

30
bench/src/App.tsx Normal file
View File

@@ -0,0 +1,30 @@
import { Container, Nav, Navbar } from "react-bootstrap";
import { NavLink, Outlet } from "react-router-dom";
export default function App() {
return (
<>
<Navbar bg="dark" variant="dark" expand="md">
<Container>
<Navbar.Brand as={NavLink} to="/">
helexa&nbsp;bench
</Navbar.Brand>
<Nav className="me-auto">
<Nav.Link as={NavLink} to="/" end>
Overview
</Nav.Link>
<Nav.Link as={NavLink} to="/trends">
Trends
</Nav.Link>
<Nav.Link as={NavLink} to="/runs">
Runs
</Nav.Link>
</Nav>
</Container>
</Navbar>
<Container className="py-4">
<Outlet />
</Container>
</>
);
}

45
bench/src/api.ts Normal file
View File

@@ -0,0 +1,45 @@
import type { Dimensions, ReportRow, RunRow, SeriesPoint } from "./types";
// Empty default → `fetch('/api/...')` hits the dev proxy (vite.config.ts)
// or the same origin. For a separately-hosted build, set VITE_API_BASE to
// the bob API origin (e.g. http://bob.hanzalova.internal:13132).
const BASE = import.meta.env.VITE_API_BASE ?? "";
async function getJson<T>(path: string): Promise<T> {
const res = await fetch(`${BASE}${path}`);
if (!res.ok) {
throw new Error(`${res.status} ${res.statusText}: ${await res.text()}`);
}
return res.json() as Promise<T>;
}
export const getDimensions = () => getJson<Dimensions>("/api/dimensions");
export const getSummary = () => getJson<ReportRow[]>("/api/summary");
// host is resolved server-side (each model maps to one host today), so the
// public UI selects by model + scenario alone.
export const getSeries = (model: string, scenario: string) =>
getJson<SeriesPoint[]>(
`/api/series?model=${encodeURIComponent(model)}&scenario=${encodeURIComponent(scenario)}`,
);
export interface RunsParams {
host?: string;
model?: string;
scenario?: string;
sha?: string;
ok?: boolean;
limit?: number;
}
export const getRuns = (p: RunsParams = {}) => {
const q = new URLSearchParams();
if (p.host) q.set("host", p.host);
if (p.model) q.set("model", p.model);
if (p.scenario) q.set("scenario", p.scenario);
if (p.sha) q.set("sha", p.sha);
if (p.ok !== undefined) q.set("ok", String(p.ok));
if (p.limit) q.set("limit", String(p.limit));
const qs = q.toString();
return getJson<RunRow[]>(`/api/runs${qs ? `?${qs}` : ""}`);
};

52
bench/src/baseline.ts Normal file
View File

@@ -0,0 +1,52 @@
// Pre-helexa-bench baseline, transcribed verbatim from doc/benchmarks.md.
//
// IMPORTANT — different measurement regime. These were measured by
// script/bench.py *through the cortex gateway* (so TTFT/total include a
// proxy hop), reported as medians only, before helexa-bench existed.
// helexa-bench measures each neuron *directly*. So these points are an
// honest historical anchor, NOT apples-to-apples with the live series —
// the Trends view renders them dashed + labelled, never merged into the
// live line.
//
// Host is inferred from the model via the doc's Fleet table
// (beast=27B, benjy=8B, quadbrat=1.7B). Timestamps are the two 2026-06-12
// snapshots in the doc, ordered (08:00 = pre-#11, 16:00 = post-#11) so
// they sort before the bench era on the shared time axis.
export interface BaselinePoint {
host: string;
model: string;
scenario: string;
git_sha: string;
build_timestamp: string;
ttft_s: number;
decode_tps: number;
total_s: number;
}
/** Source: bench.py via cortex gateway — see doc/benchmarks.md. */
export const BASELINE_SOURCE = "bench.py · via cortex gateway";
export const BASELINE: BaselinePoint[] = [
// ── 8f6f1d3 — baseline (2026-06-12) ────────────────────────────────
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 1.658, decode_tps: 35.0, total_s: 8.981 },
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 7.067, decode_tps: 33.7, total_s: 14.63 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 0.884, decode_tps: 62.4, total_s: 4.938 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 1.818, decode_tps: 46.5, total_s: 7.27 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 0.685, decode_tps: 81.3, total_s: 3.741 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 2.743, decode_tps: 35.4, total_s: 9.884 },
// ── a1952a4 — post prefix-KV-cache (#11, 2026-06-12) ───────────────
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.355, decode_tps: 45.8, total_s: 4.147 },
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.431, decode_tps: 43.3, total_s: 4.387 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 0.886, decode_tps: 78.6, total_s: 2.478 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.824, decode_tps: 58.3, total_s: 3.969 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 0.702, decode_tps: 104.8, total_s: 1.895 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 2.749, decode_tps: 44.9, total_s: 5.534 },
];
/** Baseline points for one (model, scenario) cell, oldest first. */
export function baselineFor(model: string, scenario: string): BaselinePoint[] {
return BASELINE.filter(
(b) => b.model === model && b.scenario === scenario,
).sort((a, b) => a.build_timestamp.localeCompare(b.build_timestamp));
}

22
bench/src/main.tsx Normal file
View File

@@ -0,0 +1,22 @@
import React from "react";
import ReactDOM from "react-dom/client";
import { BrowserRouter, Route, Routes } from "react-router-dom";
import "bootstrap/dist/css/bootstrap.min.css";
import App from "./App";
import Overview from "./pages/Overview";
import Trends from "./pages/Trends";
import Runs from "./pages/Runs";
ReactDOM.createRoot(document.getElementById("root")!).render(
<React.StrictMode>
<BrowserRouter>
<Routes>
<Route path="/" element={<App />}>
<Route index element={<Overview />} />
<Route path="trends" element={<Trends />} />
<Route path="runs" element={<Runs />} />
</Route>
</Routes>
</BrowserRouter>
</React.StrictMode>,
);

View File

@@ -0,0 +1,64 @@
import { useEffect, useState } from "react";
import { Alert, Spinner, Table } from "react-bootstrap";
import { getSummary } from "../api";
import type { ReportRow } from "../types";
const f = (n: number | null, p = 2) => (n == null ? "—" : n.toFixed(p));
export default function Overview() {
const [rows, setRows] = useState<ReportRow[]>([]);
const [err, setErr] = useState<string | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
getSummary()
.then(setRows)
.catch((e) => setErr(String(e)))
.finally(() => setLoading(false));
}, []);
if (loading) return <Spinner animation="border" />;
if (err) return <Alert variant="danger">{err}</Alert>;
return (
<>
<h3 className="mb-3">Latest results per cell</h3>
<p className="text-muted">
Median of each cell's samples on the most recent build seen for that
(host, model, scenario).
</p>
<Table striped bordered hover responsive size="sm">
<thead>
<tr>
<th>GPU</th>
<th>model</th>
<th className="text-end">prompt tok</th>
<th className="text-end">TTFT (s)</th>
<th className="text-end">decode tok/s</th>
<th className="text-end">total (s)</th>
<th>build</th>
<th className="text-end">n</th>
</tr>
</thead>
<tbody>
{rows.map((r, i) => (
<tr key={i}>
<td>{r.gpu ?? r.target_name}</td>
<td>{r.model_id}</td>
<td className="text-end">
{r.prompt_tokens ?? `~${r.prompt_size_approx}`}
</td>
<td className="text-end">{f(r.ttft_s_median, 3)}</td>
<td className="text-end">{f(r.decode_tps_median, 1)}</td>
<td className="text-end">{f(r.total_s_median, 3)}</td>
<td>
<code>{r.git_sha}</code>
</td>
<td className="text-end">{r.samples}</td>
</tr>
))}
</tbody>
</Table>
</>
);
}

141
bench/src/pages/Runs.tsx Normal file
View File

@@ -0,0 +1,141 @@
import { useEffect, useState } from "react";
import { Alert, Badge, Col, Form, Row, Spinner, Table } from "react-bootstrap";
import { getDimensions, getRuns } from "../api";
import type { Dimensions, RunRow } from "../types";
const f = (n: number | null, p = 2) => (n == null ? "—" : n.toFixed(p));
function Picker({
label,
value,
set,
options,
}: {
label: string;
value: string;
set: (v: string) => void;
options: string[];
}) {
return (
<Form.Group as={Col}>
<Form.Label>{label}</Form.Label>
<Form.Select value={value} onChange={(e) => set(e.target.value)}>
<option value="">(all)</option>
{options.map((o) => (
<option key={o} value={o}>
{o}
</option>
))}
</Form.Select>
</Form.Group>
);
}
export default function Runs() {
const [dims, setDims] = useState<Dimensions | null>(null);
const [host, setHost] = useState("");
const [model, setModel] = useState("");
const [scenario, setScenario] = useState("");
const [rows, setRows] = useState<RunRow[]>([]);
const [err, setErr] = useState<string | null>(null);
const [loading, setLoading] = useState(false);
useEffect(() => {
getDimensions()
.then(setDims)
.catch((e) => setErr(String(e)));
}, []);
useEffect(() => {
setLoading(true);
getRuns({
host: host || undefined,
model: model || undefined,
scenario: scenario || undefined,
limit: 200,
})
.then(setRows)
.catch((e) => setErr(String(e)))
.finally(() => setLoading(false));
}, [host, model, scenario]);
if (err) return <Alert variant="danger">{err}</Alert>;
return (
<>
<h3 className="mb-3">Runs</h3>
{dims && (
<Row className="g-3 mb-3">
{/* GPU filter — labelled by GPU, but filters by the underlying host. */}
<Form.Group as={Col}>
<Form.Label>GPU</Form.Label>
<Form.Select value={host} onChange={(e) => setHost(e.target.value)}>
<option value="">(all)</option>
{dims.hosts.map((h) => (
<option key={h} value={h}>
{dims.host_gpus[h] ?? h}
</option>
))}
</Form.Select>
</Form.Group>
<Picker
label="Model"
value={model}
set={setModel}
options={dims.models}
/>
<Picker
label="Scenario"
value={scenario}
set={setScenario}
options={dims.scenarios}
/>
</Row>
)}
{loading ? (
<Spinner animation="border" />
) : (
<Table striped bordered hover responsive size="sm">
<thead>
<tr>
<th>ts</th>
<th>GPU</th>
<th>model</th>
<th>scenario</th>
<th>build</th>
<th className="text-end">TTFT</th>
<th className="text-end">tok/s</th>
<th className="text-end">total</th>
<th>ok</th>
</tr>
</thead>
<tbody>
{rows.map((r) => (
<tr key={r.id}>
<td>{r.ts}</td>
<td>{r.gpu ?? r.host}</td>
<td>{r.model_id}</td>
<td>{r.scenario_id}</td>
<td>
<code>{r.git_sha}</code>
</td>
<td className="text-end">{f(r.ttft_s, 3)}</td>
<td className="text-end">{f(r.decode_tps, 1)}</td>
<td className="text-end">{f(r.total_s, 3)}</td>
<td>
{r.ok ? (
<Badge bg="success">ok</Badge>
) : (
<Badge bg="danger" title={r.error ?? ""}>
fail
</Badge>
)}
</td>
</tr>
))}
</tbody>
</Table>
)}
</>
);
}

221
bench/src/pages/Trends.tsx Normal file
View File

@@ -0,0 +1,221 @@
import { useEffect, useMemo, useState } from "react";
import { Alert, Col, Form, Row, Spinner } from "react-bootstrap";
import {
CartesianGrid,
Legend,
Line,
LineChart,
ReferenceLine,
ResponsiveContainer,
Tooltip,
XAxis,
YAxis,
} from "recharts";
import { getDimensions, getSeries } from "../api";
import type { Dimensions, SeriesPoint } from "../types";
import { BASELINE_SOURCE, baselineFor } from "../baseline";
function Picker({
label,
value,
set,
options,
}: {
label: string;
value: string;
set: (v: string) => void;
options: string[];
}) {
return (
<Form.Group as={Col}>
<Form.Label>{label}</Form.Label>
<Form.Select value={value} onChange={(e) => set(e.target.value)}>
{options.map((o) => (
<option key={o} value={o}>
{o}
</option>
))}
</Form.Select>
</Form.Group>
);
}
export default function Trends() {
const [dims, setDims] = useState<Dimensions | null>(null);
const [model, setModel] = useState("");
const [scenario, setScenario] = useState("");
const [series, setSeries] = useState<SeriesPoint[]>([]);
const [err, setErr] = useState<string | null>(null);
useEffect(() => {
getDimensions()
.then((d) => {
setDims(d);
if (d.models[0]) setModel(d.models[0]);
if (d.scenarios[0]) setScenario(d.scenarios[0]);
})
.catch((e) => setErr(String(e)));
}, []);
useEffect(() => {
if (model && scenario) {
getSeries(model, scenario)
.then(setSeries)
.catch((e) => setErr(String(e)));
}
}, [model, scenario]);
// Prepend the pre-helexa-bench baseline (dashed, separate keys) so it
// anchors the timeline without being merged into the live line. Different
// measurement regime — see baseline.ts / doc/benchmarks.md.
const base = useMemo(
() => baselineFor(model, scenario),
[model, scenario],
);
const data = useMemo(
() => [
...base.map((p) => ({
label: p.git_sha,
baseTtft: p.ttft_s,
baseDecode: p.decode_tps,
baseTotal: p.total_s,
})),
...series.map((p) => ({
label: p.git_sha,
ttft: p.ttft_s_median,
decode: p.decode_tps_median,
total: p.total_s_median,
})),
],
[series, base],
);
// Divider marking the boundary between the two regimes (drawn at the
// first live build, with baseline points to its left).
const firstLive = series[0]?.git_sha;
const showDivider = base.length > 0 && series.length > 0;
if (err) return <Alert variant="danger">{err}</Alert>;
if (!dims) return <Spinner animation="border" />;
return (
<>
<h3 className="mb-3">Trends over builds</h3>
<Row className="g-3 mb-4">
<Picker
label="Model"
value={model}
set={setModel}
options={dims.models}
/>
<Picker
label="Scenario"
value={scenario}
set={setScenario}
options={dims.scenarios}
/>
</Row>
{dims.model_gpus[model] && (
<p className="text-muted mb-3">
Measured on <strong>{dims.model_gpus[model]}</strong>.
</p>
)}
{data.length === 0 ? (
<Alert variant="info">No data for this selection yet.</Alert>
) : (
<>
{base.length > 0 && (
<p className="text-muted small mb-3">
Dashed = pre-helexa-bench baseline ({BASELINE_SOURCE}); solid =
helexa-bench (direct to neuron). Different measurement regimes
see <code>doc/benchmarks.md</code>.
</p>
)}
<h5 className="mt-3">decode tok/s (higher is better)</h5>
<ResponsiveContainer width="100%" height={280}>
<LineChart data={data} margin={{ top: 8, right: 24, bottom: 8, left: 0 }}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="label" />
<YAxis />
<Tooltip />
<Legend />
{showDivider && firstLive && (
<ReferenceLine
x={firstLive}
stroke="#bbb"
strokeDasharray="3 3"
label={{
value: "bench.py → helexa-bench",
position: "top",
fill: "#999",
fontSize: 11,
}}
/>
)}
<Line
type="monotone"
dataKey="decode"
name="decode tok/s"
stroke="#0d6efd"
connectNulls
/>
{base.length > 0 && (
<Line
type="monotone"
dataKey="baseDecode"
name="baseline (bench.py · gateway)"
stroke="#888"
strokeDasharray="5 5"
connectNulls
/>
)}
</LineChart>
</ResponsiveContainer>
<h5 className="mt-4">TTFT seconds (lower is better)</h5>
<ResponsiveContainer width="100%" height={280}>
<LineChart data={data} margin={{ top: 8, right: 24, bottom: 8, left: 0 }}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="label" />
<YAxis />
<Tooltip />
<Legend />
{showDivider && firstLive && (
<ReferenceLine
x={firstLive}
stroke="#bbb"
strokeDasharray="3 3"
label={{
value: "bench.py → helexa-bench",
position: "top",
fill: "#999",
fontSize: 11,
}}
/>
)}
<Line
type="monotone"
dataKey="ttft"
name="TTFT (s)"
stroke="#dc3545"
connectNulls
/>
{base.length > 0 && (
<Line
type="monotone"
dataKey="baseTtft"
name="baseline (bench.py · gateway)"
stroke="#888"
strokeDasharray="5 5"
connectNulls
/>
)}
</LineChart>
</ResponsiveContainer>
</>
)}
</>
);
}

69
bench/src/types.ts Normal file
View File

@@ -0,0 +1,69 @@
// Mirrors the JSON served by helexa-bench's read API (crates/helexa-bench/src/api.rs).
export interface BuildRef {
git_sha: string;
build_timestamp: string | null;
package_version: string | null;
}
export interface Dimensions {
hosts: string[];
models: string[];
scenarios: string[];
builds: BuildRef[];
/** host → GPU label, e.g. "2× RTX 5090". */
host_gpus: Record<string, string>;
/** model → GPU label (model maps to one host today). */
model_gpus: Record<string, string>;
}
/** Latest-SHA-per-cell medians (the report table). */
export interface ReportRow {
target_name: string;
model_id: string;
scenario_id: string;
prompt_size_approx: number;
git_sha: string;
prompt_tokens: number | null;
ttft_s_median: number | null;
decode_tps_median: number | null;
total_s_median: number | null;
samples: number;
/** Public-facing resource name (the host's GPU(s)). */
gpu: string | null;
}
/** One point in a per-build time-series for a (host, model, scenario) cell. */
export interface SeriesPoint {
git_sha: string;
build_timestamp: string | null;
package_version: string | null;
ttft_s_median: number | null;
decode_tps_median: number | null;
total_s_median: number | null;
samples: number;
}
export interface RunRow {
id: number;
ts: string;
host: string;
/** Public-facing resource name (the host's GPU(s)). */
gpu: string | null;
hostname: string | null;
git_sha: string;
build_timestamp: string | null;
package_version: string;
model_id: string;
harness: string;
scenario_id: string;
prompt_size_approx: number;
prompt_tokens_actual: number | null;
max_tokens: number;
ttft_s: number | null;
decode_tps: number | null;
total_s: number | null;
completion_tokens: number | null;
ok: boolean;
error: string | null;
}

9
bench/src/vite-env.d.ts vendored Normal file
View File

@@ -0,0 +1,9 @@
/// <reference types="vite/client" />
interface ImportMetaEnv {
/** Base origin of the bench API. Empty → use the dev proxy / same origin. */
readonly VITE_API_BASE?: string;
}
interface ImportMeta {
readonly env: ImportMetaEnv;
}

22
bench/tsconfig.json Normal file
View File

@@ -0,0 +1,22 @@
{
"compilerOptions": {
"target": "ES2022",
"useDefineForClassFields": true,
"lib": ["ES2022", "DOM", "DOM.Iterable"],
"module": "ESNext",
"skipLibCheck": true,
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"resolveJsonModule": true,
"isolatedModules": true,
"moduleDetection": "force",
"noEmit": true,
"jsx": "react-jsx",
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"noFallthroughCasesInSwitch": true,
"types": ["node", "vite/client"]
},
"include": ["src", "vite.config.ts"]
}

18
bench/vite.config.ts Normal file
View File

@@ -0,0 +1,18 @@
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react-swc";
// Dev server proxies /api to the bench API on bob so `fetch('/api/...')`
// works without CORS/mixed-origin fuss during local development.
// For a production build hosted elsewhere, set VITE_API_BASE to the bob
// API origin (e.g. http://bob.hanzalova.internal:13132) instead.
export default defineConfig({
plugins: [react()],
server: {
proxy: {
"/api": {
target: "http://bob.hanzalova.internal:13132",
changeOrigin: true,
},
},
},
});

View File

@@ -5,6 +5,11 @@
# Environment variable overrides use CORTEX_ prefix with __ separators:
# CORTEX_GATEWAY__LISTEN=0.0.0.0:31313
# Path to the model catalogue (limits, cost, pinning, aliases, feasibility).
# Defaults to the packaged location below; uncomment to override for a
# non-packaged / local run.
# models_config = "/etc/cortex/models.toml"
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"

View File

@@ -4,7 +4,7 @@ Release: 1%{?dist}
Summary: Inference gateway for multi-node GPU clusters
License: GPL-3.0-or-later
URL: https://git.lair.cafe/helexa/cortex
URL: https://git.lair.cafe/helexa/helexa
Source0: %{name}-%{version}.tar.gz
Source1: %{name}-%{version}-vendor.tar.gz

View File

@@ -0,0 +1,119 @@
//! Build/version metadata shared between cortex and neuron.
//!
//! neuron captures these facts at compile time in its `build.rs`
//! (git SHA, enabled cargo features, rustc/candle versions, …) and
//! serves them from `GET /version`. cortex and `helexa-bench`
//! deserialize the same struct so a benchmark run can be attributed to
//! the exact daemon build that produced it — not just the host's CUDA
//! and driver versions that `/discovery` already reports.
//!
//! Every field beyond the always-present package version is
//! `#[serde(default)]` so a newer reader stays compatible with an
//! older neuron that omits a field (and vice versa) — the same
//! forward/backward-compat discipline as
//! [`crate::discovery::ActivationStatus`].
use serde::{Deserialize, Serialize};
/// Build-time identity of a neuron daemon.
///
/// Returned by `GET /version`. The `git_sha` is the canonical "which
/// build is live" key — benchmark records are bucketed by it, so a
/// regression can be pinned to a daemon change rather than a host
/// change. When neuron is built from a source tarball with no git
/// metadata available (and no `HELEXA_BUILD_SHA` injected by CI/RPM),
/// `git_sha` is the string `"unknown"`.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct BuildInfo {
/// Crate version from `CARGO_PKG_VERSION` (e.g. `"0.1.16"`).
pub package_version: String,
/// Short git SHA, or `"unknown"` when unavailable at build time.
#[serde(default = "unknown")]
pub git_sha: String,
/// Full 40-char git SHA when available.
#[serde(default)]
pub git_sha_long: Option<String>,
/// Whether the working tree had uncommitted changes at build time.
/// `false` when the SHA is unknown (tarball build).
#[serde(default)]
pub git_dirty: bool,
/// RFC3339 build timestamp.
#[serde(default)]
pub build_timestamp: Option<String>,
/// `rustc --version` output of the compiler used.
#[serde(default)]
pub rustc_version: Option<String>,
/// Cargo build profile: `"release"` or `"debug"`.
#[serde(default)]
pub profile: Option<String>,
/// Target triple the binary was compiled for.
#[serde(default)]
pub target: Option<String>,
/// Enabled cargo features (e.g. `["cuda", "cudnn"]`). These define
/// the performance envelope, so they are recorded against every
/// benchmark run.
#[serde(default)]
pub features: Vec<String>,
/// Locked `candle-core` version, best-effort from `Cargo.lock`.
#[serde(default)]
pub candle_version: Option<String>,
}
fn unknown() -> String {
"unknown".to_string()
}
impl BuildInfo {
/// A placeholder used by non-neuron benchmark targets (and tests)
/// that have no build metadata to report.
pub fn unknown() -> Self {
BuildInfo {
package_version: env!("CARGO_PKG_VERSION").to_string(),
git_sha: unknown(),
git_sha_long: None,
git_dirty: false,
build_timestamp: None,
rustc_version: None,
profile: None,
target: None,
features: Vec::new(),
candle_version: None,
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn round_trips_full() {
let info = BuildInfo {
package_version: "0.1.16".into(),
git_sha: "30d50d6".into(),
git_sha_long: Some("30d50d6abc123".into()),
git_dirty: true,
build_timestamp: Some("2026-06-13T10:00:00+00:00".into()),
rustc_version: Some("rustc 1.85.0".into()),
profile: Some("release".into()),
target: Some("x86_64-unknown-linux-gnu".into()),
features: vec!["cuda".into(), "cudnn".into()],
candle_version: Some("0.10.2".into()),
};
let json = serde_json::to_string(&info).unwrap();
let back: BuildInfo = serde_json::from_str(&json).unwrap();
assert_eq!(info, back);
}
#[test]
fn deserializes_minimal_payload() {
// An older neuron might send only the package version; every
// other field must default rather than fail.
let back: BuildInfo = serde_json::from_str(r#"{"package_version":"0.1.0"}"#).unwrap();
assert_eq!(back.package_version, "0.1.0");
assert_eq!(back.git_sha, "unknown");
assert!(!back.git_dirty);
assert!(back.features.is_empty());
assert!(back.candle_version.is_none());
}
}

View File

@@ -1,6 +1,7 @@
//! Model catalogue — profiles describing how to serve each model.
use crate::discovery::DeviceInfo;
use crate::harness::{ModelCost, ModelLimit};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::path::Path;
@@ -24,6 +25,32 @@ pub struct ModelProfile {
/// Neurons where this model should never be evicted.
#[serde(default)]
pub pinned_on: Vec<String>,
/// Source scheme this profile's weights come from. When set, the
/// router prefixes `id` with `scheme:` before forwarding the load
/// request to neuron, ensuring the daemon fetches from the right
/// registry regardless of which entry happens to match `id`.
///
/// `None` lets neuron substitute its own `default_source` (typically
/// `huggingface`). Set to `"helexa"` when the model is hosted in
/// the helexa registry — operator-procurement-grade audit relies
/// on this being explicit per model rather than implicit.
#[serde(default)]
pub source: Option<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Per-model token budget. When present, advertised in `/v1/models`
/// so clients can size and compact their context automatically.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing (USD per 1M tokens). `0.0` for self-hosted.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// Static capability flags the operator wants to advertise even
/// before the model is loaded on any neuron (e.g. `"reasoning"`,
/// `"tool_call"`). Runtime-detected capabilities from the harness
/// are unioned with this set in the gateway's `/v1/models` response.
#[serde(default)]
pub capabilities: Vec<String>,
}
fn default_min_devices() -> u32 {
@@ -140,6 +167,10 @@ mod tests {
min_devices: 2,
min_device_vram_mb: Some(24_000),
pinned_on: vec![],
source: None,
limit: None,
cost: None,
capabilities: vec![],
}
}
@@ -197,6 +228,29 @@ mod tests {
assert_eq!(cat.resolve_alias("Qwen/Qwen3-8B"), "Qwen/Qwen3-8B");
}
#[test]
fn source_defaults_to_none_when_absent_from_toml() {
let src = r#"
[[models]]
id = "Qwen/Qwen3-30B"
harness = "candle"
"#;
let cat: ModelCatalogue = toml::from_str(src).expect("parse models table");
assert!(cat.models[0].source.is_none());
}
#[test]
fn source_round_trips_through_toml() {
let src = r#"
[[models]]
id = "Helexa/Qwen3.6-27B-Uncensored"
harness = "candle"
source = "helexa"
"#;
let cat: ModelCatalogue = toml::from_str(src).expect("parse models table");
assert_eq!(cat.models[0].source.as_deref(), Some("helexa"));
}
#[test]
fn aliases_table_round_trips_through_toml() {
let src = r#"

View File

@@ -11,13 +11,21 @@ pub struct GatewayConfig {
pub eviction: EvictionSettings,
/// Neuron endpoints (replaces old NodeConfig with static vram_mb/pinned).
pub neurons: Vec<NeuronEndpoint>,
/// Path to the model catalogue file (default: "models.toml").
/// Path to the model catalogue file. Defaults to the packaged
/// location (`/etc/cortex/models.toml`); set explicitly for
/// non-packaged / local runs.
#[serde(default = "default_models_path")]
pub models_config: String,
}
fn default_models_path() -> String {
"models.toml".into()
// Absolute, so the systemd-launched binary finds the catalogue
// regardless of its working directory. The RPM installs the catalogue
// here (`cortex.spec`); a relative "models.toml" silently resolved to
// the service cwd and left the catalogue empty in production
// (pinning / aliases / limits all no-ops). Override via `models_config`
// in cortex.toml for local runs.
"/etc/cortex/models.toml".into()
}
#[derive(Debug, Clone, Serialize, Deserialize)]

View File

@@ -22,6 +22,23 @@ pub struct DiscoveryResponse {
pub driver_version: Option<String>,
pub devices: Vec<DeviceInfo>,
pub harnesses: Vec<String>,
/// Set when the host has an NVIDIA stack that is currently
/// unusable — specifically the userspace↔kernel-module version
/// skew after an un-rebooted driver update ("Driver/library
/// version mismatch"), where every CUDA call including nvidia-smi
/// fails (#19). `None` on healthy hosts AND on hosts with no
/// NVIDIA stack at all (CPU-only is not an error). Carries an
/// operator-actionable description; cortex can read it to route
/// around the node instead of cold-loading into a guaranteed
/// failure.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cuda_unavailable_reason: Option<String>,
/// The neuron's effective maximum prompt size in tokens
/// (`NEURON_MAX_PROMPT_TOKENS`) — the enforced prompt cap on this
/// host. `#[serde(default)]` (→ 0) for forward-compat with neurons
/// that predate this field; cortex treats 0 as "unknown".
#[serde(default)]
pub max_prompt_tokens: u64,
}
/// Runtime health metrics for a single GPU device.

View File

@@ -0,0 +1,257 @@
//! The OpenAI-standard error envelope (#60) and the rejection contract
//! that rides on it (#63).
//!
//! Every non-2xx response cortex and neuron emit uses the shape
//!
//! ```json
//! { "error": { "message": "...", "type": "...", "code": "...", "param": null } }
//! ```
//!
//! because OpenAI-compatible clients (opencode, the AI SDK, litellm, the
//! OpenAI SDKs) read `error.type` / `error.code` to decide what to do —
//! most importantly `code == "context_length_exceeded"` triggers
//! auto-compaction, and a `429` with `Retry-After` makes them back off and
//! retry rather than surfacing an opaque failure. A flat `{"error":"..."}`
//! string is invisible to that logic.
//!
//! This module is the single source of truth for that envelope. It is
//! deliberately **axum-agnostic** — cortex-core is a pure types crate — so
//! it carries the response as data (`status`, `body()`, `retry_after_secs`)
//! and each HTTP crate (cortex-gateway, neuron) owns a tiny adapter that
//! turns an [`OpenAiError`] into its framework's response type, setting the
//! `Retry-After` header when present.
//!
//! Retryable conditions **must** carry `Retry-After` (per #63). The named
//! constructors below encode that: [`OpenAiError::rate_limit_exceeded`] and
//! [`OpenAiError::service_unavailable`] take a retry hint;
//! [`OpenAiError::insufficient_quota`] (hard balance, no reset) and
//! [`OpenAiError::context_length_exceeded`] / [`OpenAiError::invalid_api_key`]
//! (permanent) do not. `402 Payment Required` is banned by the contract — use
//! `429 insufficient_quota` for hard budget exhaustion.
use serde_json::{Map, Value, json};
/// A rejection rendered in the OpenAI error envelope.
///
/// Build with [`OpenAiError::new`] (or a named constructor), refine with the
/// `with_*` builders, then hand to the consuming crate's adapter to turn into
/// an HTTP response.
#[derive(Debug, Clone)]
pub struct OpenAiError {
/// HTTP status code (e.g. `401`, `429`, `503`).
pub status: u16,
/// Broad OpenAI category — `"invalid_request_error"`, `"api_error"`,
/// `"rate_limit_error"`, …
pub error_type: String,
/// Specific machine-readable code clients key on (`"invalid_api_key"`,
/// `"rate_limit_exceeded"`, `"context_length_exceeded"`, …). `None`
/// renders as JSON `null`.
pub code: Option<String>,
/// Human-readable, actionable message.
pub message: String,
/// OpenAI's `param` field — the offending request parameter, if any.
pub param: Option<String>,
/// Seconds to advertise in the `Retry-After` header. Set only on
/// retryable conditions; `None` means no header.
pub retry_after_secs: Option<u64>,
/// Diagnostic fields merged *inside* the `error` object (e.g.
/// `prompt_len`, `max`, `free_mb`) so they don't break the envelope
/// shape. Clients ignore unknown keys.
pub extra: Map<String, Value>,
}
impl OpenAiError {
/// Construct an envelope with an explicit code. For a `null` code use
/// [`OpenAiError::without_code`].
pub fn new(
status: u16,
error_type: impl Into<String>,
code: impl Into<String>,
message: impl Into<String>,
) -> Self {
Self {
status,
error_type: error_type.into(),
code: Some(code.into()),
message: message.into(),
param: None,
retry_after_secs: None,
extra: Map::new(),
}
}
/// Construct an envelope whose `code` is `null` (e.g. an unclassified
/// internal error).
pub fn without_code(
status: u16,
error_type: impl Into<String>,
message: impl Into<String>,
) -> Self {
Self {
status,
error_type: error_type.into(),
code: None,
message: message.into(),
param: None,
retry_after_secs: None,
extra: Map::new(),
}
}
/// Advertise a `Retry-After` (seconds). Use on retryable rejections.
pub fn with_retry_after(mut self, secs: u64) -> Self {
self.retry_after_secs = Some(secs);
self
}
/// Set the OpenAI `param` field.
pub fn with_param(mut self, param: impl Into<String>) -> Self {
self.param = Some(param.into());
self
}
/// Merge one diagnostic field into the error object.
pub fn with_extra(mut self, key: impl Into<String>, value: Value) -> Self {
self.extra.insert(key.into(), value);
self
}
/// Merge a bag of diagnostic fields into the error object.
pub fn with_extras(mut self, extras: Map<String, Value>) -> Self {
for (k, v) in extras {
self.extra.insert(k, v);
}
self
}
/// Render the `{ "error": { … } }` body. Field order is irrelevant to
/// clients (they parse JSON); the standard keys come first, then any
/// diagnostic extras.
pub fn body(&self) -> Value {
let mut error = Map::new();
error.insert("message".into(), Value::String(self.message.clone()));
error.insert("type".into(), Value::String(self.error_type.clone()));
error.insert(
"code".into(),
self.code.clone().map(Value::String).unwrap_or(Value::Null),
);
error.insert(
"param".into(),
self.param.clone().map(Value::String).unwrap_or(Value::Null),
);
for (k, v) in &self.extra {
error.insert(k.clone(), v.clone());
}
json!({ "error": Value::Object(error) })
}
// ── Named constructors for the #63 standard codes ──────────────────
/// `401 invalid_api_key` — missing/invalid bearer token (#49). Permanent.
pub fn invalid_api_key(message: impl Into<String>) -> Self {
Self::new(401, "invalid_request_error", "invalid_api_key", message)
}
/// `429 rate_limit_exceeded` + `Retry-After` — transient overload,
/// fair-share/in-flight cap, admission rejection, or a rolling budget
/// window that resets (#52/#53/#54/#55). Clients back off and retry.
pub fn rate_limit_exceeded(message: impl Into<String>, retry_after_secs: u64) -> Self {
Self::new(429, "rate_limit_error", "rate_limit_exceeded", message)
.with_retry_after(retry_after_secs)
}
/// `429 insufficient_quota` — hard balance exhausted, no reset (#52).
/// No `Retry-After`; the client surfaces and stops. (Never `402`.)
pub fn insufficient_quota(message: impl Into<String>) -> Self {
Self::new(429, "insufficient_quota", "insufficient_quota", message)
}
/// `400 context_length_exceeded` — prompt exceeds the model's context
/// window (#56/#60). Permanent for this request; opencode auto-compacts.
pub fn context_length_exceeded(message: impl Into<String>) -> Self {
Self::new(
400,
"invalid_request_error",
"context_length_exceeded",
message,
)
}
/// `503 service_unavailable` + optional `Retry-After` — transient
/// backend unavailability (no healthy nodes, recovery, fail-closed
/// upstream). Retryable when a hint is given.
pub fn service_unavailable(message: impl Into<String>, retry_after_secs: Option<u64>) -> Self {
let mut err = Self::new(503, "api_error", "service_unavailable", message);
err.retry_after_secs = retry_after_secs;
err
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn body_has_standard_envelope_shape() {
let env = OpenAiError::new(429, "rate_limit_error", "rate_limit_exceeded", "slow down");
let body = env.body();
let error = body.get("error").and_then(Value::as_object).unwrap();
assert_eq!(error["message"], "slow down");
assert_eq!(error["type"], "rate_limit_error");
assert_eq!(error["code"], "rate_limit_exceeded");
assert_eq!(error["param"], Value::Null);
}
#[test]
fn without_code_renders_null_code() {
let env = OpenAiError::without_code(500, "api_error", "kaboom");
assert_eq!(env.body()["error"]["code"], Value::Null);
}
#[test]
fn extras_ride_inside_the_error_object() {
let env = OpenAiError::context_length_exceeded("too long")
.with_extra("prompt_len", json!(60_000))
.with_extra("max", json!(49_152));
let error = &env.body()["error"];
assert_eq!(error["prompt_len"], 60_000);
assert_eq!(error["max"], 49_152);
assert_eq!(error["code"], "context_length_exceeded");
}
#[test]
fn rolling_window_rejection_carries_retry_after() {
let env = OpenAiError::rate_limit_exceeded("budget window", 30);
assert_eq!(env.status, 429);
assert_eq!(env.retry_after_secs, Some(30));
}
#[test]
fn hard_balance_rejection_has_no_retry_after() {
let env = OpenAiError::insufficient_quota("out of credit");
assert_eq!(env.status, 429);
assert_eq!(env.code.as_deref(), Some("insufficient_quota"));
assert_eq!(env.retry_after_secs, None);
}
#[test]
fn permanent_rejections_have_no_retry_after() {
assert_eq!(OpenAiError::invalid_api_key("nope").retry_after_secs, None);
assert_eq!(
OpenAiError::context_length_exceeded("too long").retry_after_secs,
None
);
}
#[test]
fn service_unavailable_retry_after_is_optional() {
assert_eq!(
OpenAiError::service_unavailable("recovering", Some(5)).retry_after_secs,
Some(5)
);
assert_eq!(
OpenAiError::service_unavailable("gone", None).retry_after_secs,
None
);
}
}

View File

@@ -36,6 +36,44 @@ pub struct ModelSpec {
pub devices: Option<Vec<u32>>,
}
/// Per-model token budget advertised by the catalogue or neuron.
///
/// `context` is the hard wall (the served max-seq-len). `input` is the
/// compaction trigger — when set, opencode treats it as "usable context =
/// input reserved". When omitted, clients fall back to `context output`.
/// `output` is the maximum number of generation tokens.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelLimit {
/// Hard wall — served max-seq-len in tokens.
pub context: usize,
/// Compaction trigger / usable input budget. When absent clients fall
/// back to `context output`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub input: Option<usize>,
/// Maximum number of generation tokens.
pub output: usize,
}
/// Operator-set pricing in USD per 1M tokens.
///
/// Self-hosted deployments typically leave both at `0.0`. Cache fields are
/// optional — set when the backend supports a prefix-cache discount tier.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCost {
/// USD per 1M input (prompt) tokens.
#[serde(default)]
pub input: f64,
/// USD per 1M output (completion) tokens.
#[serde(default)]
pub output: f64,
/// USD per 1M cache-hit tokens (optional).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cache_read: Option<f64>,
/// USD per 1M cache-write tokens (optional).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cache_write: Option<f64>,
}
/// A model as reported by a harness.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelInfo {
@@ -44,6 +82,33 @@ pub struct ModelInfo {
pub status: String,
pub devices: Vec<u32>,
pub vram_used_mb: Option<u64>,
/// Modalities this loaded model supports. Today: `["text"]` for
/// text-only checkpoints, `["text", "vision"]` for vision-capable
/// ones (Stage B7). Clients like litellm / agent0 can gate
/// `image_url` submission on the advertised set.
///
/// Optional in the wire format so older clients that don't read
/// it stay compatible. Default-empty for absent/older data, which
/// callers can interpret as "text".
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub capabilities: Vec<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Token budget advertised by the catalogue or discovered at load time.
/// `None` when neither the catalogue nor the loaded model can provide it.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// `true` when the model's tokenizer contains recognised tool-call
/// marker tokens (`<tool_call>` / `<\/tool_call>` convention).
#[serde(default)]
pub tool_call: bool,
/// `true` when the model's tokenizer contains recognised reasoning
/// marker tokens (`<think>` / `<\/think>` or similar).
#[serde(default)]
pub reasoning: bool,
}
/// What an inference harness must do, from neuron's perspective.

View File

@@ -1,10 +1,13 @@
pub mod anthropic;
pub mod build_info;
pub mod catalogue;
pub mod config;
pub mod discovery;
pub mod error_envelope;
pub mod harness;
pub mod metrics;
pub mod node;
pub mod openai;
pub mod responses;
pub mod source;
pub mod translate;

View File

@@ -1,4 +1,5 @@
use crate::discovery::{ActivationStatus, DiscoveryResponse};
use crate::harness::{ModelCost, ModelLimit};
use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
@@ -37,6 +38,27 @@ pub struct ModelEntry {
pub last_accessed: Option<DateTime<Utc>>,
/// Estimated VRAM usage in MB when loaded.
pub vram_estimate_mb: Option<u64>,
/// Modalities the loaded model advertises (e.g. `["text", "vision"]`),
/// copied verbatim from the neuron's `ModelInfo.capabilities` at poll
/// time. Empty when the neuron reports none. `#[serde(default)]` keeps
/// older persisted/serialised entries deserialisable.
#[serde(default)]
pub capabilities: Vec<String>,
/// Runtime-detected capability flags from the neuron's `/models`
/// response (`ModelInfo`). `false` when the neuron predates these
/// fields or hasn't reported them yet.
#[serde(default)]
pub tool_call: bool,
#[serde(default)]
pub reasoning: bool,
/// Self-derived token budget the neuron computed for this loaded
/// model (#67), copied from `ModelInfo.limit` at poll time. `None`
/// when the neuron doesn't compute one (arch without a context
/// profile, or derivation disabled). This is the authoritative
/// source the gateway advertises — operator-declared catalogue
/// limits are no longer consulted.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
}
/// Model lifecycle status.
@@ -55,6 +77,12 @@ pub enum ModelStatus {
Unloaded,
Reloading,
Loading,
/// Reported by neuron while a poisoned model auto-recovers via
/// unload→reload (#17/#20). Temporarily unservable but NOT
/// evicted: the gateway holds the route, answers with a transient
/// retry error instead of 404, and must not race a second
/// placement elsewhere.
Recovering,
}
/// Unified model entry as exposed by the gateway's `/v1/models` endpoint.
@@ -85,6 +113,27 @@ pub struct CortexModelEntry {
/// disjoint from) `feasible_on` depending on whether the catalogue
/// covers this model.
pub locations: Vec<ModelLocation>,
/// Union of the modalities advertised by every neuron that has this
/// model loaded (e.g. `["text", "vision"]`). Empty for catalogue-only
/// entries with no loaded location — filled from catalogue profile
/// capabilities when available, then unioned with runtime-detected
/// values from loaded neurons.
#[serde(default)]
pub capabilities: Vec<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Per-model token budget from the catalogue profile or discovered
/// at load time. `None` when neither source provides it.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// `true` when any neuron reports this model supports tool calls.
#[serde(default)]
pub tool_call: bool,
/// `true` when any neuron reports this model supports reasoning tokens.
#[serde(default)]
pub reasoning: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]

View File

@@ -71,10 +71,18 @@ pub struct ChatCompletionChoice {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChatCompletionChunk {
#[serde(default)]
pub id: String,
#[serde(default)]
pub object: String,
#[serde(default)]
pub created: u64,
// Lenient deserialization throughout: the gateway parses chunks
// from arbitrary OpenAI-compatible upstreams, and some engines
// omit fields on special frames (e.g. usage-only final chunks).
#[serde(default)]
pub model: String,
#[serde(default)]
pub choices: Vec<ChunkChoice>,
#[serde(skip_serializing_if = "Option::is_none")]
pub usage: Option<Usage>,
@@ -98,6 +106,31 @@ pub struct Usage {
pub prompt_tokens: u64,
pub completion_tokens: u64,
pub total_tokens: u64,
/// OpenAI-standard breakdown of `completion_tokens`. Optional and
/// additive — clients that don't read it are unaffected. Carries
/// `reasoning_tokens` for reasoning models (a sub-count of
/// `completion_tokens`, never added into `total_tokens`).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub completion_tokens_details: Option<CompletionTokensDetails>,
/// OpenAI-standard breakdown of `prompt_tokens`. Populated once
/// prompt caching lands (#11); `None` until then.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub prompt_tokens_details: Option<PromptTokensDetails>,
}
/// Sub-counts of `Usage::completion_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CompletionTokensDetails {
/// Tokens generated inside the model's reasoning span.
pub reasoning_tokens: u64,
}
/// Sub-counts of `Usage::prompt_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PromptTokensDetails {
/// Prompt tokens served from cache (cache-read rate). Populated
/// once prompt caching lands (#11).
pub cached_tokens: u64,
}
// ── Models list response ─────────────────────────────────────────────

View File

@@ -202,6 +202,30 @@ pub struct ResponsesUsage {
pub input_tokens: u64,
pub output_tokens: u64,
pub total_tokens: u64,
/// OpenAI-standard breakdown of `output_tokens`. Optional and
/// additive. Carries `reasoning_tokens` for reasoning models (a
/// sub-count of `output_tokens`, never added into `total_tokens`).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub output_tokens_details: Option<OutputTokensDetails>,
/// OpenAI-standard breakdown of `input_tokens`. Populated once
/// prompt caching lands (#11); `None` until then.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub input_tokens_details: Option<InputTokensDetails>,
}
/// Sub-counts of `ResponsesUsage::output_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OutputTokensDetails {
/// Tokens generated inside the model's reasoning span.
pub reasoning_tokens: u64,
}
/// Sub-counts of `ResponsesUsage::input_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct InputTokensDetails {
/// Input tokens served from cache (cache-read rate). Populated
/// once prompt caching lands (#11).
pub cached_tokens: u64,
}
// ── Streaming event names ────────────────────────────────────────────
@@ -336,6 +360,8 @@ mod tests {
input_tokens: 5,
output_tokens: 3,
total_tokens: 8,
output_tokens_details: None,
input_tokens_details: None,
}),
};
let json = serde_json::to_string(&r).unwrap();

View File

@@ -0,0 +1,267 @@
//! Scheme-qualified model identifiers.
//!
//! cortex/neuron historically resolves every model id through hf-hub
//! against `https://huggingface.co`. Helexa is adding an EU-hosted
//! registry (`registry.helexa.ai`) alongside HF — both speak the same
//! HF-compatible wire format, but the bytes, jurisdiction, and trust
//! root differ. Model ids therefore need a scheme:
//!
//! - `huggingface:Qwen/Qwen3.6-27B` — HF-hosted bytes
//! - `helexa:Qwen/Qwen3.6-27B-Uncensored` — helexa registry bytes
//! - `helexa:SomeOperator/CustomFinetune` — operator publishing
//! under the helexa namespace; same scheme handles all `org/name`
//! pairs hosted in that registry.
//!
//! Bare `org/name` parses with an empty scheme; the caller (typically
//! a harness) substitutes its configured default scheme so existing
//! configs keep working through the transition.
use serde::{Deserialize, Serialize};
use std::fmt;
use std::str::FromStr;
/// Parsed `scheme:org/name`. Bare `org/name` produces an empty scheme
/// — call `with_default_scheme` (or check `is_scheme_unset`) to
/// resolve before using.
#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct ModelSourceId {
pub scheme: String,
pub org: String,
pub name: String,
}
/// Errors from `ModelSourceId::from_str`. Carries the offending input
/// so log lines / API errors can echo what the operator typed.
#[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)]
pub enum ParseError {
#[error("empty model id")]
Empty,
#[error("model id '{0}' is missing the '/' between org and name")]
MissingSlash(String),
#[error("model id '{0}' has an empty scheme before ':'")]
EmptyScheme(String),
#[error("model id '{0}' has an empty org")]
EmptyOrg(String),
#[error("model id '{0}' has an empty name")]
EmptyName(String),
#[error("model id '{0}' has a scheme containing '/' which is reserved for org/name")]
SchemeContainsSlash(String),
#[error("model id '{0}' has a name containing ':' which is reserved for the scheme prefix")]
NameContainsColon(String),
}
impl ModelSourceId {
/// Construct directly from already-validated parts. Used by tests
/// and call sites that have the fields separately; the public API
/// for parsing user input is `FromStr`.
pub fn new(scheme: impl Into<String>, org: impl Into<String>, name: impl Into<String>) -> Self {
Self {
scheme: scheme.into(),
org: org.into(),
name: name.into(),
}
}
/// True when this id parsed from a bare `org/name` (no scheme
/// prefix). The harness substitutes its configured default in
/// `with_default_scheme` before resolving against a registry.
pub fn is_scheme_unset(&self) -> bool {
self.scheme.is_empty()
}
/// Substitute `default` for an empty scheme. No-op when the scheme
/// is already set. Returns self by value so it composes neatly:
/// `id.parse::<ModelSourceId>()?.with_default_scheme("huggingface")`.
pub fn with_default_scheme(mut self, default: &str) -> Self {
if self.scheme.is_empty() {
self.scheme = default.to_string();
}
self
}
/// The `org/name` half — what an hf-hub `Api::model(...)` call
/// expects regardless of which scheme/endpoint we're hitting.
pub fn repo_path(&self) -> String {
format!("{}/{}", self.org, self.name)
}
}
impl fmt::Display for ModelSourceId {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
if self.scheme.is_empty() {
write!(f, "{}/{}", self.org, self.name)
} else {
write!(f, "{}:{}/{}", self.scheme, self.org, self.name)
}
}
}
impl FromStr for ModelSourceId {
type Err = ParseError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
if s.is_empty() {
return Err(ParseError::Empty);
}
// Scheme split. Only the *first* colon counts — anything after
// belongs to org/name (and would be rejected separately because
// `:` isn't allowed there).
let (scheme, rest) = match s.split_once(':') {
Some((scheme, rest)) => {
if scheme.is_empty() {
return Err(ParseError::EmptyScheme(s.to_string()));
}
if scheme.contains('/') {
return Err(ParseError::SchemeContainsSlash(s.to_string()));
}
(scheme.to_string(), rest)
}
None => (String::new(), s),
};
let (org, name) = rest
.split_once('/')
.ok_or_else(|| ParseError::MissingSlash(s.to_string()))?;
if org.is_empty() {
return Err(ParseError::EmptyOrg(s.to_string()));
}
if name.is_empty() {
return Err(ParseError::EmptyName(s.to_string()));
}
if name.contains(':') {
return Err(ParseError::NameContainsColon(s.to_string()));
}
Ok(Self {
scheme,
org: org.to_string(),
name: name.to_string(),
})
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parses_qualified() {
let id: ModelSourceId = "huggingface:Qwen/Qwen3.6-27B".parse().unwrap();
assert_eq!(id.scheme, "huggingface");
assert_eq!(id.org, "Qwen");
assert_eq!(id.name, "Qwen3.6-27B");
assert_eq!(id.repo_path(), "Qwen/Qwen3.6-27B");
assert!(!id.is_scheme_unset());
}
#[test]
fn parses_helexa_scheme() {
let id: ModelSourceId = "helexa:SomeOperator/Qwen3.6-27B-Uncensored"
.parse()
.unwrap();
assert_eq!(id.scheme, "helexa");
assert_eq!(id.org, "SomeOperator");
assert_eq!(id.name, "Qwen3.6-27B-Uncensored");
}
#[test]
fn parses_bare_id_with_empty_scheme() {
let id: ModelSourceId = "Qwen/Qwen3-30B-A3B-Instruct".parse().unwrap();
assert_eq!(id.scheme, "");
assert_eq!(id.org, "Qwen");
assert_eq!(id.name, "Qwen3-30B-A3B-Instruct");
assert!(id.is_scheme_unset());
}
#[test]
fn substitutes_default_scheme_only_when_unset() {
let id: ModelSourceId = "Qwen/Q3".parse().unwrap();
assert_eq!(id.with_default_scheme("huggingface").scheme, "huggingface");
let id: ModelSourceId = "helexa:Qwen/Q3".parse().unwrap();
assert_eq!(
id.with_default_scheme("huggingface").scheme,
"helexa",
"default substitution must not override an explicit scheme"
);
}
#[test]
fn display_roundtrips_qualified_id() {
let s = "helexa:Helexa/Qwen3.6-27B";
let id: ModelSourceId = s.parse().unwrap();
assert_eq!(id.to_string(), s);
}
#[test]
fn display_roundtrips_bare_id() {
let s = "Qwen/Q3";
let id: ModelSourceId = s.parse().unwrap();
assert_eq!(id.to_string(), s);
}
#[test]
fn rejects_empty() {
assert_eq!("".parse::<ModelSourceId>().unwrap_err(), ParseError::Empty);
}
#[test]
fn rejects_missing_slash() {
match "Qwen".parse::<ModelSourceId>().unwrap_err() {
ParseError::MissingSlash(s) => assert_eq!(s, "Qwen"),
other => panic!("expected MissingSlash, got {other:?}"),
}
match "huggingface:Qwen".parse::<ModelSourceId>().unwrap_err() {
ParseError::MissingSlash(s) => assert_eq!(s, "huggingface:Qwen"),
other => panic!("expected MissingSlash, got {other:?}"),
}
}
#[test]
fn rejects_empty_scheme() {
match ":Qwen/Q3".parse::<ModelSourceId>().unwrap_err() {
ParseError::EmptyScheme(s) => assert_eq!(s, ":Qwen/Q3"),
other => panic!("expected EmptyScheme, got {other:?}"),
}
}
#[test]
fn rejects_scheme_with_slash() {
match "hugg/ingface:Q/N".parse::<ModelSourceId>().unwrap_err() {
ParseError::SchemeContainsSlash(s) => assert_eq!(s, "hugg/ingface:Q/N"),
other => panic!("expected SchemeContainsSlash, got {other:?}"),
}
}
#[test]
fn rejects_empty_org_or_name() {
match "huggingface:/N".parse::<ModelSourceId>().unwrap_err() {
ParseError::EmptyOrg(_) => {}
other => panic!("expected EmptyOrg, got {other:?}"),
}
match "huggingface:Q/".parse::<ModelSourceId>().unwrap_err() {
ParseError::EmptyName(_) => {}
other => panic!("expected EmptyName, got {other:?}"),
}
}
#[test]
fn rejects_name_with_colon() {
match "huggingface:Q/N:weird"
.parse::<ModelSourceId>()
.unwrap_err()
{
ParseError::NameContainsColon(s) => assert_eq!(s, "huggingface:Q/N:weird"),
other => panic!("expected NameContainsColon, got {other:?}"),
}
}
#[test]
fn serde_roundtrips_via_struct() {
// We serialize as a struct (scheme/org/name fields) so the
// shape is self-describing in API payloads. Callers that want
// the compact `scheme:org/name` string use `Display`/`FromStr`.
let id = ModelSourceId::new("helexa", "Helexa", "Qwen3.6-27B");
let json = serde_json::to_string(&id).unwrap();
let back: ModelSourceId = serde_json::from_str(&json).unwrap();
assert_eq!(back, id);
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,211 @@
//! Streaming Anthropic SSE translation (#24).
//!
//! The `/v1/messages` handler translates the request envelope to
//! OpenAI before proxying (see `cortex_core::translate`); this module
//! completes the round trip for `stream: true` — the upstream OpenAI
//! SSE stream is re-framed, event by event, into Anthropic's
//! `message_start` / `content_block_*` / `message_delta` /
//! `message_stop` sequence as it arrives. True streaming: each
//! upstream chunk is translated and forwarded immediately; nothing is
//! buffered beyond the current SSE event's bytes.
//!
//! The translation state machine itself is pure and lives in
//! [`cortex_core::translate::AnthropicStreamTranslator`]; this module
//! owns the wire concerns — splitting the upstream byte stream into
//! SSE events, parsing `data:` payloads, and framing the translated
//! events as `event: <name>\ndata: <json>\n\n`.
use axum::body::Body;
use axum::http::StatusCode;
use axum::response::Response;
use bytes::Bytes;
use cortex_core::openai::ChatCompletionChunk;
use cortex_core::translate::AnthropicStreamTranslator;
use futures::StreamExt;
use tokio_stream::wrappers::ReceiverStream;
/// Forward the translated OpenAI request to the upstream node and
/// return the response translated to Anthropic SSE framing.
pub async fn stream_translated(
client: &reqwest::Client,
endpoint: &str,
openai_body: axum::body::Bytes,
model_id: &str,
node_name: &str,
) -> Response {
let url = format!("{endpoint}/v1/chat/completions");
tracing::info!(
handler = "anthropic_messages",
model = %model_id,
node = %node_name,
url = %url,
"proxying streaming request (anthropic SSE translation)"
);
let upstream = match client
.post(&url)
.header("content-type", "application/json")
.body(openai_body)
.send()
.await
{
Ok(r) => r,
Err(e) => {
tracing::warn!(
handler = "anthropic_messages",
node = %node_name,
url = %url,
error = %e,
"anthropic stream: upstream request failed"
);
return anthropic_error(StatusCode::BAD_GATEWAY, "upstream request failed");
}
};
let status = upstream.status();
if !status.is_success() {
tracing::warn!(
handler = "anthropic_messages",
node = %node_name,
url = %url,
status = status.as_u16(),
"anthropic stream: upstream returned non-2xx"
);
return anthropic_error(
StatusCode::from_u16(status.as_u16()).unwrap_or(StatusCode::BAD_GATEWAY),
"upstream returned an error",
);
}
// Bounded channel: a slow client back-pressures the pump task,
// which back-pressures the upstream read — same propagation
// discipline as neuron's own projectors.
let (tx, rx) = tokio::sync::mpsc::channel::<Result<Bytes, std::convert::Infallible>>(32);
let node = node_name.to_string();
let model = model_id.to_string();
tokio::spawn(async move {
let mut upstream = upstream.bytes_stream();
let mut translator = AnthropicStreamTranslator::new();
let mut buf: Vec<u8> = Vec::new();
let mut done = false;
// Wire-debug accounting for the stream summary emitted at the
// end: did the model emit a structured tool call, what was the
// final finish_reason, and how many upstream frames did we see.
let mut saw_tool_call = false;
let mut last_finish: Option<String> = None;
let mut frames = 0u64;
'outer: while let Some(block) = upstream.next().await {
let block = match block {
Ok(b) => b,
Err(e) => {
tracing::warn!(node = %node, error = %e, "anthropic stream: upstream read failed mid-stream");
break;
}
};
buf.extend_from_slice(&block);
// SSE events are separated by a blank line.
while let Some(pos) = find_event_boundary(&buf) {
let event: Vec<u8> = buf.drain(..pos + 2).collect();
let text = String::from_utf8_lossy(&event);
for line in text.lines() {
let Some(data) = line.strip_prefix("data:") else {
continue;
};
let data = data.trim();
if data == "[DONE]" {
done = true;
if !send_frames(&tx, translator.finish()).await {
break 'outer;
}
continue;
}
tracing::trace!(node = %node, frame = %data, "anthropic stream: upstream frame");
let Ok(chunk) = serde_json::from_str::<ChatCompletionChunk>(data) else {
tracing::debug!(node = %node, "anthropic stream: unparsable upstream frame skipped");
continue;
};
frames += 1;
if chunk
.choices
.iter()
.any(|c| c.delta.get("tool_calls").is_some())
{
saw_tool_call = true;
}
if let Some(fr) = chunk.choices.iter().find_map(|c| c.finish_reason.clone()) {
last_finish = Some(fr);
}
if !send_frames(&tx, translator.on_chunk(&chunk)).await {
break 'outer;
}
}
}
}
// Upstream ended without [DONE] (error or truncation): still
// close the Anthropic event sequence so clients aren't left
// with an unterminated message.
if !done {
let _ = send_frames(&tx, translator.finish()).await;
}
// Stream summary: the streaming counterpart to the non-streaming
// handler's "upstream response" line. `upstream_tool_calls =
// false` on a tools-bearing request is the fingerprint of the
// model improvising an unparsed tool-call format.
tracing::debug!(
wire = "anthropic",
model = %model,
node = %node,
frames,
upstream_tool_calls = saw_tool_call,
finish_reason = ?last_finish,
terminated = done,
"anthropic stream complete"
);
});
Response::builder()
.status(StatusCode::OK)
.header("content-type", "text/event-stream")
.header("cache-control", "no-cache")
.body(Body::from_stream(ReceiverStream::new(rx)))
.unwrap_or_else(|_| {
anthropic_error(
StatusCode::INTERNAL_SERVER_ERROR,
"failed to build response",
)
})
}
/// `\n\n` boundary of the first complete SSE event in `buf`, if any.
fn find_event_boundary(buf: &[u8]) -> Option<usize> {
buf.windows(2).position(|w| w == b"\n\n")
}
/// Render translated events as SSE frames and send them. Returns
/// `false` when the client has gone away (receiver dropped).
async fn send_frames(
tx: &tokio::sync::mpsc::Sender<Result<Bytes, std::convert::Infallible>>,
events: Vec<(String, serde_json::Value)>,
) -> bool {
for (name, payload) in events {
let frame = format!("event: {name}\ndata: {payload}\n\n");
if tx.send(Ok(Bytes::from(frame))).await.is_err() {
return false;
}
}
true
}
/// Anthropic-shaped error body (`{"type":"error","error":{...}}`).
fn anthropic_error(status: StatusCode, message: &str) -> Response {
let body = serde_json::json!({
"type": "error",
"error": { "type": "api_error", "message": message }
});
Response::builder()
.status(status)
.header("content-type", "application/json")
.body(Body::from(body.to_string()))
.expect("static error response must build")
}

View File

@@ -0,0 +1,24 @@
//! Gateway adapter that turns the shared, axum-agnostic
//! [`cortex_core::error_envelope::OpenAiError`] into an axum [`Response`],
//! setting the `Retry-After` header when the envelope carries one.
//!
//! cortex-core owns the envelope shape and the rejection contract (#60/#63);
//! this is the only place the gateway crosses from that data into axum.
use axum::http::{HeaderValue, StatusCode, header};
use axum::response::{IntoResponse, Json, Response};
use cortex_core::error_envelope::OpenAiError;
/// Render an [`OpenAiError`] as an axum response (status + JSON envelope +
/// optional `Retry-After`).
pub fn envelope_response(err: OpenAiError) -> Response {
let status = StatusCode::from_u16(err.status).unwrap_or(StatusCode::INTERNAL_SERVER_ERROR);
let retry_after = err.retry_after_secs;
let mut response = (status, Json(err.body())).into_response();
if let Some(secs) = retry_after
&& let Ok(value) = HeaderValue::from_str(&secs.to_string())
{
response.headers_mut().insert(header::RETRY_AFTER, value);
}
response
}

View File

@@ -11,6 +11,8 @@ use axum::http::HeaderMap;
use axum::response::{IntoResponse, Json, Response};
use axum::routing::{get, post};
use chrono::Utc;
use cortex_core::error_envelope::OpenAiError;
use cortex_core::harness::ModelLimit;
use cortex_core::node::{CortexModelEntry, ModelLocation};
use serde_json::{Value, json};
use std::sync::Arc;
@@ -33,6 +35,7 @@ async fn chat_completions(
headers: HeaderMap,
body: Bytes,
) -> Response {
log_inbound("openai-chat", "/v1/chat/completions", &body);
let model_id = match extract_model(&body) {
Some(m) => m,
None => {
@@ -40,7 +43,12 @@ async fn chat_completions(
handler = "chat_completions",
"rejected: missing 'model' field in request body"
);
return error_response(400, "missing 'model' field in request body");
return error_response(
400,
"invalid_request_error",
"missing_model_field",
"missing 'model' field in request body",
);
}
};
@@ -53,11 +61,7 @@ async fn chat_completions(
error = %e,
"route resolve failed"
);
// RouteError's Display strings are short and informative
// ("model 'X' not found...", "no healthy nodes available")
// — fine to surface to the caller. The warn above carries
// any extra context for operators.
return error_response(404, &e.to_string());
return route_error_response(&e);
}
};
@@ -89,6 +93,7 @@ async fn responses(
headers: HeaderMap,
body: Bytes,
) -> Response {
log_inbound("openai-responses", "/v1/responses", &body);
let model_id = match extract_model(&body) {
Some(m) => m,
None => {
@@ -96,7 +101,12 @@ async fn responses(
handler = "responses",
"rejected: missing 'model' field in request body"
);
return error_response(400, "missing 'model' field in request body");
return error_response(
400,
"invalid_request_error",
"missing_model_field",
"missing 'model' field in request body",
);
}
};
@@ -109,7 +119,7 @@ async fn responses(
error = %e,
"route resolve failed"
);
return error_response(404, &e.to_string());
return route_error_response(&e);
}
};
@@ -133,6 +143,7 @@ async fn completions(
headers: HeaderMap,
body: Bytes,
) -> Response {
log_inbound("openai-completions", "/v1/completions", &body);
let model_id = match extract_model(&body) {
Some(m) => m,
None => {
@@ -140,7 +151,12 @@ async fn completions(
handler = "completions",
"rejected: missing 'model' field in request body"
);
return error_response(400, "missing 'model' field in request body");
return error_response(
400,
"invalid_request_error",
"missing_model_field",
"missing 'model' field in request body",
);
}
};
@@ -153,11 +169,7 @@ async fn completions(
error = %e,
"route resolve failed"
);
// RouteError's Display strings are short and informative
// ("model 'X' not found...", "no healthy nodes available")
// — fine to surface to the caller. The warn above carries
// any extra context for operators.
return error_response(404, &e.to_string());
return route_error_response(&e);
}
};
@@ -178,7 +190,7 @@ async fn completions(
/// `POST /v1/messages` — accept Anthropic format, translate, proxy, translate back.
async fn anthropic_messages(
State(fleet): State<Arc<CortexState>>,
headers: HeaderMap,
_headers: HeaderMap,
body: Bytes,
) -> Response {
// Parse as Anthropic request.
@@ -190,13 +202,48 @@ async fn anthropic_messages(
error = %e,
"rejected: invalid Anthropic request body"
);
return error_response(400, "invalid Anthropic request body");
return error_response(
400,
"invalid_request_error",
"invalid_anthropic_body",
"invalid Anthropic request body",
);
}
};
let model_id = anth_req.model.clone();
let is_streaming = anth_req.stream.unwrap_or(false);
// Wire-debug: make the exercised path and request shape concrete
// rather than guesswork. `tool_history` flags whether the client is
// continuing a tool conversation (tool_use/tool_result blocks in the
// message history) vs. opening a fresh one. Full bodies ride at
// trace! (cortex/neuron ship at info; operator infra runs at debug).
if tracing::enabled!(tracing::Level::DEBUG) {
let n_tools = anth_req
.extra
.get("tools")
.and_then(Value::as_array)
.map(|a| a.len())
.unwrap_or(0);
let tool_history = anth_req
.messages
.iter()
.any(|m| anthropic_message_has_tool_blocks(&m.content));
tracing::debug!(
wire = "anthropic",
endpoint = "/v1/messages",
model = %model_id,
stream = is_streaming,
messages = anth_req.messages.len(),
tools = n_tools,
tool_history,
system = anth_req.system.is_some(),
"inbound request"
);
}
tracing::trace!(wire = "anthropic", body = %body_preview(&body), "inbound anthropic body");
// Translate to OpenAI format.
let openai_req = cortex_core::translate::anthropic_to_openai(anth_req);
let openai_body = match serde_json::to_vec(&openai_req) {
@@ -208,7 +255,12 @@ async fn anthropic_messages(
error = %e,
"internal: failed to serialise translated OpenAI request"
);
return error_response(500, "internal translation error");
return error_response(
500,
"api_error",
"internal_translation_error",
"internal translation error",
);
}
};
@@ -225,7 +277,7 @@ async fn anthropic_messages(
// ("model 'X' not found...", "no healthy nodes available")
// — fine to surface to the caller. The warn above carries
// any extra context for operators.
return error_response(404, &e.to_string());
return route_error_response(&e);
}
};
@@ -235,6 +287,14 @@ async fn anthropic_messages(
// neuron's harness sees a model name that matches what it has
// loaded.
let openai_body = rewrite_model_in_body(openai_body, &route.resolved_model_id);
// The translated body is what neuron actually sees — the reshaped
// OpenAI-form tools live here. Tracing it makes "did the tool
// definitions survive translation?" a log line, not a guess.
tracing::trace!(
wire = "anthropic",
body = %body_preview(&openai_body),
"translated openai body (sent upstream)"
);
let labels = [
("model", route.resolved_model_id.clone()),
@@ -247,28 +307,23 @@ async fn anthropic_messages(
let start = Instant::now();
if is_streaming {
// TODO: streaming Anthropic translation requires converting SSE format.
// For now, proxy the OpenAI SSE stream directly (clients that can handle
// OpenAI SSE will work; full Anthropic SSE translation is a follow-up).
let result = proxy::forward_request(
// Anthropic SSE translation (#24): upstream speaks OpenAI SSE;
// re-frame it event-by-event into Anthropic's message_start /
// content_block_* / message_delta / message_stop sequence.
let resp = crate::anthropic_sse::stream_translated(
&fleet.http_client,
&route,
"/v1/chat/completions",
headers,
&route.endpoint,
openai_body,
&model_id,
&route.node_name,
)
.await;
metrics::histogram!("cortex_request_duration_seconds", &labels)
.record(start.elapsed().as_secs_f64());
match result {
Ok(resp) => resp,
Err(e) => {
metrics::counter!("cortex_request_errors_total", &labels).increment(1);
// forward_request already warn'd with the wire-level
// detail; no need to log again here.
e.into_response()
}
if !resp.status().is_success() {
metrics::counter!("cortex_request_errors_total", &labels).increment(1);
}
resp
} else {
// Non-streaming: proxy, buffer full response, translate back to Anthropic.
let target_url = format!("{}/v1/chat/completions", route.endpoint);
@@ -300,7 +355,12 @@ async fn anthropic_messages(
error = %e,
"upstream request failed (network)"
);
return error_response(502, "upstream request failed");
return error_response(
502,
"api_error",
"upstream_connection_error",
"upstream request failed",
);
}
};
@@ -319,7 +379,12 @@ async fn anthropic_messages(
body = %body_snippet,
"upstream returned non-2xx"
);
return error_response(status, &format!("upstream returned {status}"));
return error_response(
status,
"api_error",
"upstream_error",
&format!("upstream returned {status}"),
);
}
let body_bytes = match upstream_resp.bytes().await {
@@ -334,7 +399,12 @@ async fn anthropic_messages(
error = %e,
"failed to read upstream response body"
);
return error_response(502, "failed to read upstream response");
return error_response(
502,
"api_error",
"upstream_connection_error",
"failed to read upstream response",
);
}
};
@@ -356,17 +426,59 @@ async fn anthropic_messages(
body = %body_snippet,
"failed to parse upstream response as OpenAI ChatCompletionResponse"
);
return error_response(502, "malformed upstream response");
return error_response(
502,
"api_error",
"upstream_malformed_response",
"malformed upstream response",
);
}
};
metrics::histogram!("cortex_request_duration_seconds", &labels)
.record(start.elapsed().as_secs_f64());
// Did the model actually produce a structured tool call, or just
// text? This is the single most useful signal for "is tool
// calling working end-to-end" — a `false` here alongside a
// request that carried tools means the model improvised an
// unparsed format (the original failure mode).
let upstream_tool_calls = openai_resp.choices.iter().any(|c| {
c.message
.extra
.get("tool_calls")
.and_then(Value::as_array)
.map(|a| !a.is_empty())
.unwrap_or(false)
});
let finish_reason = openai_resp
.choices
.first()
.and_then(|c| c.finish_reason.clone());
tracing::debug!(
wire = "anthropic",
model = %model_id,
node = %route.node_name,
upstream_tool_calls,
finish_reason = ?finish_reason,
"upstream non-streaming response"
);
let anthropic_resp = cortex_core::translate::openai_to_anthropic(openai_resp);
Json(json!(anthropic_resp)).into_response()
}
}
/// Combine two self-derived limits for the same model loaded on
/// different neurons (#67): keep the tightest (smallest `context`) so a
/// client sized against the advertised limit never overflows the
/// most-constrained deployment that might serve the request. `None`
/// means "that neuron reported no limit"; the present one wins.
fn tightest_limit(a: Option<ModelLimit>, b: Option<ModelLimit>) -> Option<ModelLimit> {
match (a, b) {
(None, x) | (x, None) => x,
(Some(a), Some(b)) => Some(if b.context < a.context { b } else { a }),
}
}
/// `GET /v1/models` — union of (catalogue × topology feasibility) and
/// (currently loaded somewhere). The result is what the fleet *could*
/// serve, not just what's already loaded — so OpenAI-compatible tools
@@ -414,6 +526,20 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
loaded: false,
feasible_on,
locations: Vec::new(),
// Start with catalogue-declared capabilities; Pass 2 unions
// runtime-detected ones from loaded neurons.
capabilities: profile.capabilities.clone(),
// `limit` is no longer operator-declared (#67): the neuron
// self-derives it from live VRAM + throughput and reports it
// per loaded model — Pass 2 fills it from the neuron's
// ModelEntry. A catalogue `limit`, if present, is ignored
// (it can't track hot-swapped models or live capacity).
// `cost` stays operator-set and flows from the catalogue.
limit: None,
cost: profile.cost.clone(),
// Runtime-detected — will be OR-ed in Pass 2 from neuron data.
tool_call: false,
reasoning: false,
},
);
}
@@ -438,6 +564,23 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
if was_loaded {
e.loaded = true;
}
// Union the per-node capabilities so a model loaded
// on several neurons reports every modality any of
// them advertises.
for cap in &entry.capabilities {
if !e.capabilities.contains(cap) {
e.capabilities.push(cap.clone());
}
}
// OR-in runtime-detected capability flags from the neuron.
e.tool_call = e.tool_call || entry.tool_call;
e.reasoning = e.reasoning || entry.reasoning;
// Adopt the neuron's self-derived limit (#67). When a
// model is loaded on several neurons with different
// headroom, advertise the tightest (smallest context)
// so a client never overflows the most-constrained
// deployment that might serve it.
e.limit = tightest_limit(e.limit.take(), entry.limit.clone());
})
.or_insert_with(|| CortexModelEntry {
id: model_id.clone(),
@@ -449,6 +592,11 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
// feasibility; leave empty.
feasible_on: Vec::new(),
locations: vec![location],
capabilities: entry.capabilities.clone(),
limit: entry.limit.clone(),
cost: None,
tool_call: entry.tool_call,
reasoning: entry.reasoning,
});
}
}
@@ -498,6 +646,13 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
loaded: false,
feasible_on: Vec::new(),
locations: vec![location],
// A model that's only mid-prewarm has no loaded
// location to read capabilities from yet.
capabilities: Vec::new(),
limit: None,
cost: None,
tool_call: false,
reasoning: false,
});
}
}
@@ -527,6 +682,11 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
loaded: target_entry.loaded,
feasible_on: target_entry.feasible_on,
locations: target_entry.locations,
capabilities: target_entry.capabilities,
limit: target_entry.limit.clone(),
cost: target_entry.cost.clone(),
tool_call: target_entry.tool_call,
reasoning: target_entry.reasoning,
},
);
}
@@ -575,7 +735,8 @@ async fn proxy_with_metrics(
}
let start = Instant::now();
let result = proxy::forward_request(&fleet.http_client, route, path, headers, body).await;
let result =
proxy::forward_request(&fleet.http_client, route, path, headers, body, model_id).await;
let duration = start.elapsed();
match result {
@@ -609,6 +770,57 @@ fn extract_model(body: &[u8]) -> Option<String> {
v.get("model")?.as_str().map(|s| s.to_string())
}
/// Emit a uniform wire-debug summary for an OpenAI-family inbound
/// request (chat/completions, completions, responses). Makes which
/// surface a client exercised — and whether it sent tools / asked for
/// streaming — a concrete log line. The full body rides at trace!.
///
/// Parsing is gated on the debug level being enabled so info-level
/// deployments pay nothing.
fn log_inbound(wire: &str, endpoint: &str, body: &[u8]) {
if tracing::enabled!(tracing::Level::DEBUG) {
let v: Value = match serde_json::from_slice(body) {
Ok(v) => v,
Err(_) => return,
};
let model = v.get("model").and_then(Value::as_str).unwrap_or("?");
let stream = v.get("stream").and_then(Value::as_bool).unwrap_or(false);
let tools = v
.get("tools")
.and_then(Value::as_array)
.map(|a| a.len())
.unwrap_or(0);
tracing::debug!(wire, endpoint, model, stream, tools, "inbound request");
}
tracing::trace!(wire, endpoint, body = %body_preview(body), "inbound body");
}
/// True if an Anthropic message's content carries any `tool_use` or
/// `tool_result` block — i.e. the client is mid tool-conversation.
fn anthropic_message_has_tool_blocks(content: &cortex_core::anthropic::AnthropicContent) -> bool {
use cortex_core::anthropic::AnthropicContent;
match content {
AnthropicContent::Text(_) => false,
AnthropicContent::Blocks(blocks) => blocks
.iter()
.any(|b| matches!(b.block_type.as_str(), "tool_use" | "tool_result")),
}
}
/// Render a UTF-8-safe, length-capped preview of a request/response
/// body for trace logging. Caps by characters (not bytes) so the slice
/// can never split a multi-byte codepoint.
fn body_preview(body: &[u8]) -> String {
const MAX_CHARS: usize = 8192;
let text = String::from_utf8_lossy(body);
if text.chars().count() > MAX_CHARS {
let head: String = text.chars().take(MAX_CHARS).collect();
format!("{head}…<truncated, {} bytes total>", body.len())
} else {
text.into_owned()
}
}
/// Rewrite the `model` field of an OpenAI-style JSON request body to
/// the resolved concrete id. Returns the original bytes if `new_model`
/// matches what's already there or the body fails to parse — the
@@ -641,14 +853,16 @@ fn rewrite_model_in_body(body: Bytes, new_model: &str) -> Bytes {
}
}
fn error_response(status: u16, message: &str) -> Response {
let code = axum::http::StatusCode::from_u16(status)
.unwrap_or(axum::http::StatusCode::INTERNAL_SERVER_ERROR);
let body = json!({
"error": {
"message": message,
"type": "gateway_error",
}
});
(code, Json(body)).into_response()
fn error_response(status: u16, typ: &str, code: &str, message: &str) -> Response {
crate::error::envelope_response(OpenAiError::new(status, typ, code, message))
}
/// Render a [`RouteError`] in the standard envelope, attaching `Retry-After`
/// for its transient variants (#63).
fn route_error_response(e: &router::RouteError) -> Response {
let mut env = OpenAiError::new(e.http_status(), e.broad_type(), e.code(), e.to_string());
if let Some(secs) = e.retry_after_secs() {
env = env.with_retry_after(secs);
}
crate::error::envelope_response(env)
}

View File

@@ -1,3 +1,5 @@
pub mod anthropic_sse;
pub mod error;
pub mod evictor;
pub mod handlers;
pub mod metrics;

View File

@@ -46,6 +46,14 @@ fn describe_metrics() {
"Generation throughput in tokens per second"
);
metrics::describe_counter!("cortex_requests_total", "Total number of proxied requests");
metrics::describe_counter!(
"cortex_prompt_tokens_total",
"Total prompt tokens reported by upstream usage objects"
);
metrics::describe_counter!(
"cortex_completion_tokens_total",
"Total completion tokens reported by upstream usage objects"
);
metrics::describe_counter!(
"cortex_request_errors_total",
"Total number of failed proxy requests"

View File

@@ -26,14 +26,23 @@ pub async fn poll_once(fleet: &CortexState) {
}
}
/// One-shot fetch of `GET /discovery`. Cached on the NodeState forever
/// after the first success — topology is invariant for a given neuron
/// process. Skipped when the cache is already populated.
/// Fetch `GET /discovery` and cache it on the NodeState — topology is
/// invariant for a given neuron process, so a successful fetch is kept.
/// Re-polled only while `max_prompt_tokens` is still unknown (0): on a
/// rolling deploy cortex can win the race and cache a neuron's discovery
/// before that neuron reports the field (it deserialises to 0). Re-polling
/// until a real cap arrives self-heals that without periodic polling.
async fn maybe_poll_discovery(fleet: &CortexState, name: &str, endpoint: &str) {
{
let nodes = fleet.nodes.read().await;
match nodes.get(name) {
Some(n) if n.discovery.is_some() => return,
Some(n)
if n.discovery
.as_ref()
.is_some_and(|d| d.max_prompt_tokens > 0) =>
{
return;
}
_ => {}
}
}
@@ -107,12 +116,22 @@ async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
.and_modify(|e| {
e.status = status;
e.vram_estimate_mb = upstream.vram_used_mb;
e.capabilities = upstream.capabilities.clone();
e.tool_call = upstream.tool_call;
e.reasoning = upstream.reasoning;
// Neuron's self-derived limit (#67) — the
// authoritative source the gateway advertises.
e.limit = upstream.limit.clone();
})
.or_insert_with(|| ModelEntry {
id: upstream.id.clone(),
status,
last_accessed: None,
vram_estimate_mb: upstream.vram_used_mb,
capabilities: upstream.capabilities.clone(),
tool_call: upstream.tool_call,
reasoning: upstream.reasoning,
limit: upstream.limit.clone(),
});
}
@@ -195,6 +214,7 @@ fn parse_status(s: &str) -> ModelStatus {
"unloaded" => ModelStatus::Unloaded,
"reloading" => ModelStatus::Reloading,
"loading" => ModelStatus::Loading,
"recovering" => ModelStatus::Recovering,
_ => ModelStatus::Loaded,
}
}

View File

@@ -9,7 +9,12 @@ use anyhow::Result;
use axum::body::Body;
use axum::http::{HeaderMap, StatusCode};
use axum::response::{IntoResponse, Response};
use futures::Stream;
use futures::stream::BoxStream;
use reqwest::Client;
use std::pin::Pin;
use std::task::{Context, Poll};
use std::time::Instant;
/// Proxy a request body to the resolved backend node and stream the response.
///
@@ -25,7 +30,9 @@ pub async fn forward_request(
path: &str,
headers: HeaderMap,
body: bytes::Bytes,
model_id: &str,
) -> Result<Response, ProxyError> {
let request_start = Instant::now();
let url = format!("{}{}", route.endpoint, path);
tracing::info!(
node = %route.node_name,
@@ -73,7 +80,10 @@ pub async fn forward_request(
let status = StatusCode::from_u16(upstream_status.as_u16()).unwrap_or(StatusCode::BAD_GATEWAY);
let resp_headers = upstream_resp.headers().clone();
let stream = upstream_resp.bytes_stream();
let stream = TokenMetricsStream::new(
Box::pin(upstream_resp.bytes_stream()),
TokenMetrics::new(model_id, &route.node_name, request_start),
);
let body = Body::from_stream(stream);
@@ -103,19 +113,244 @@ pub enum ProxyError {
impl IntoResponse for ProxyError {
fn into_response(self) -> Response {
let (status, message) = match &self {
ProxyError::Upstream(_) => (StatusCode::BAD_GATEWAY, "upstream request failed"),
let (status, code, message) = match &self {
ProxyError::Upstream(_) => (
StatusCode::BAD_GATEWAY,
"upstream_connection_error",
"upstream request failed",
),
ProxyError::ResponseBuild(_) => (
StatusCode::INTERNAL_SERVER_ERROR,
"internal_server_error",
"failed to build response",
),
};
let body = serde_json::json!({
"error": {
"message": message,
"type": "proxy_error",
}
});
(status, axum::Json(body)).into_response()
crate::error::envelope_response(cortex_core::error_envelope::OpenAiError::new(
status.as_u16(),
"api_error",
code,
message,
))
}
}
// ── Per-request token metrics (#21) ─────────────────────────────────
//
// The proxy never buffers or re-serialises the upstream body — chunks
// are forwarded verbatim. For metrics it observes each chunk's arrival
// time and keeps a bounded tail of the body text, from which the final
// OpenAI `usage` object (present on the last SSE chunk and on
// non-streaming JSON bodies alike) yields engine-truth token counts.
//
// Emitted per request, labelled {model, node}:
// cortex_time_to_first_token_seconds (histogram) — first body chunk
// cortex_tokens_per_second (histogram) — completion tokens
// over the decode window (first→last chunk); falls back to the
// full request duration for single-chunk (non-streaming) bodies
// cortex_prompt_tokens_total / cortex_completion_tokens_total (counters)
/// Cap on the retained body tail. The usage object rides on the final
/// chunk, so a generous tail is plenty; the cap bounds memory on huge
/// non-streaming bodies.
const TAIL_CAP_BYTES: usize = 64 * 1024;
/// Find the value of the LAST `"key": <integer>` occurrence in `tail`.
/// Pure and chunk-boundary-safe (the tail is contiguous appended text).
/// The quoted-needle form means `completion_tokens` never matches
/// `completion_tokens_details`.
pub(crate) fn last_count_for(tail: &str, key: &str) -> Option<u64> {
let needle = format!("\"{key}\"");
let mut result = None;
for (idx, _) in tail.match_indices(&needle) {
let rest = tail[idx + needle.len()..].trim_start();
let Some(rest) = rest.strip_prefix(':') else {
continue;
};
let rest = rest.trim_start();
let digits: &str = &rest[..rest
.char_indices()
.find(|(_, c)| !c.is_ascii_digit())
.map(|(i, _)| i)
.unwrap_or(rest.len())];
if let Ok(v) = digits.parse::<u64>() {
result = Some(v);
}
}
result
}
struct TokenMetrics {
labels: [(&'static str, String); 2],
request_start: Instant,
first_chunk: Option<Instant>,
last_chunk: Option<Instant>,
tail: String,
finished: bool,
}
impl TokenMetrics {
fn new(model_id: &str, node_name: &str, request_start: Instant) -> Self {
Self {
labels: [
("model", model_id.to_string()),
("node", node_name.to_string()),
],
request_start,
first_chunk: None,
last_chunk: None,
tail: String::new(),
finished: false,
}
}
fn observe(&mut self, chunk: &[u8]) {
let now = Instant::now();
self.first_chunk.get_or_insert(now);
self.last_chunk = Some(now);
self.tail.push_str(&String::from_utf8_lossy(chunk));
if self.tail.len() > TAIL_CAP_BYTES {
// Keep the newest half; the usage object is always at the
// very end of the body. Split at a char boundary.
let mut cut = self.tail.len() - TAIL_CAP_BYTES / 2;
while !self.tail.is_char_boundary(cut) {
cut += 1;
}
self.tail.drain(..cut);
}
}
/// Emit the metrics exactly once — called on clean stream end and
/// from Drop (client disconnect mid-stream still records what we
/// saw).
fn finish(&mut self) {
if self.finished {
return;
}
self.finished = true;
let Some(first) = self.first_chunk else {
return; // no body ever arrived — nothing to record
};
let ttft = first.duration_since(self.request_start).as_secs_f64();
metrics::histogram!("cortex_time_to_first_token_seconds", &self.labels).record(ttft);
if let Some(prompt) = last_count_for(&self.tail, "prompt_tokens") {
metrics::counter!("cortex_prompt_tokens_total", &self.labels).increment(prompt);
}
let Some(completion) = last_count_for(&self.tail, "completion_tokens") else {
return;
};
if completion == 0 {
return;
}
metrics::counter!("cortex_completion_tokens_total", &self.labels).increment(completion);
let last = self.last_chunk.unwrap_or(first);
let decode_window = last.duration_since(first).as_secs_f64();
// Streaming: rate over the decode window (first→last chunk).
// Non-streaming bodies arrive as ~one chunk (window ≈ 0), where
// the only honest denominator is the full request duration.
let secs = if decode_window >= 0.1 {
decode_window
} else {
last.duration_since(self.request_start).as_secs_f64()
};
if secs > 0.0 {
metrics::histogram!("cortex_tokens_per_second", &self.labels)
.record(completion as f64 / secs);
}
}
}
/// Pass-through stream wrapper that feeds [`TokenMetrics`]. Emits on
/// clean end-of-stream; the Drop impl covers client disconnects.
struct TokenMetricsStream {
inner: BoxStream<'static, Result<bytes::Bytes, reqwest::Error>>,
metrics: TokenMetrics,
}
impl TokenMetricsStream {
fn new(
inner: BoxStream<'static, Result<bytes::Bytes, reqwest::Error>>,
metrics: TokenMetrics,
) -> Self {
Self { inner, metrics }
}
}
impl Stream for TokenMetricsStream {
type Item = Result<bytes::Bytes, reqwest::Error>;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
let this = self.get_mut();
match this.inner.as_mut().poll_next(cx) {
Poll::Ready(Some(Ok(chunk))) => {
this.metrics.observe(&chunk);
Poll::Ready(Some(Ok(chunk)))
}
Poll::Ready(Some(Err(e))) => Poll::Ready(Some(Err(e))),
Poll::Ready(None) => {
this.metrics.finish();
Poll::Ready(None)
}
Poll::Pending => Poll::Pending,
}
}
}
impl Drop for TokenMetricsStream {
fn drop(&mut self) {
self.metrics.finish();
}
}
#[cfg(test)]
mod tests {
use super::last_count_for;
#[test]
fn extracts_counts_from_final_sse_usage_chunk() {
let tail = concat!(
"data: {\"choices\":[{\"delta\":{\"content\":\"hi\"}}]}\n\n",
"data: {\"choices\":[],\"usage\":{\"prompt_tokens\":225,",
"\"completion_tokens\":42,\"total_tokens\":267}}\n\n",
"data: [DONE]\n\n"
);
assert_eq!(last_count_for(tail, "prompt_tokens"), Some(225));
assert_eq!(last_count_for(tail, "completion_tokens"), Some(42));
}
#[test]
fn extracts_counts_from_non_streaming_body() {
let tail = "{\"choices\":[{\"message\":{\"content\":\"hi\"}}],\
\"usage\":{\"prompt_tokens\": 12, \"completion_tokens\": 7}}";
assert_eq!(last_count_for(tail, "prompt_tokens"), Some(12));
assert_eq!(last_count_for(tail, "completion_tokens"), Some(7));
}
#[test]
fn ignores_details_variants_and_takes_last_occurrence() {
// completion_tokens_details must not shadow completion_tokens,
// and the LAST usage object wins (matters when content echoes
// a usage-shaped string earlier in the stream).
let tail = concat!(
"data: {\"usage\":{\"completion_tokens\":1}}\n\n",
"data: {\"usage\":{\"completion_tokens\":99,",
"\"completion_tokens_details\":{\"reasoning_tokens\":3}}}\n\n"
);
assert_eq!(last_count_for(tail, "completion_tokens"), Some(99));
}
#[test]
fn absent_keys_yield_none() {
assert_eq!(
last_count_for("data: [DONE]\n\n", "completion_tokens"),
None
);
assert_eq!(last_count_for("", "prompt_tokens"), None);
// key present but non-numeric value
assert_eq!(
last_count_for("\"completion_tokens\": null", "completion_tokens"),
None
);
}
}

View File

@@ -56,6 +56,59 @@ pub enum RouteError {
node: String,
message: String,
},
#[error(
"model '{model_id}' is recovering on node '{node}' (device context rebuild in progress) — retry shortly"
)]
ModelRecovering { model_id: String, node: String },
}
impl RouteError {
/// HTTP status the gateway should answer with. `NoHealthyNodes` and
/// `ModelRecovering` are the transient cases (503 service_unavailable,
/// safe to retry the same request); everything else is 404.
pub fn http_status(&self) -> u16 {
match self {
RouteError::NoHealthyNodes | RouteError::ModelRecovering { .. } => 503,
_ => 404,
}
}
/// Broad OpenAI error category for the JSON envelope.
pub fn broad_type(&self) -> &'static str {
match self {
RouteError::ModelNotFound(_) => "invalid_request_error",
RouteError::NoHealthyNodes
| RouteError::EndpointResolveFailed(_, _)
| RouteError::NoFeasibleNeuron { .. }
| RouteError::ColdLoadFailed { .. }
| RouteError::ModelRecovering { .. } => "api_error",
}
}
/// Specific machine-readable error code.
pub fn code(&self) -> &'static str {
match self {
RouteError::ModelNotFound(_) => "model_not_found",
RouteError::NoHealthyNodes => "service_unavailable",
RouteError::EndpointResolveFailed(_, _) => "service_unavailable",
RouteError::NoFeasibleNeuron { .. } => "service_unavailable",
RouteError::ColdLoadFailed { .. } => "service_unavailable",
RouteError::ModelRecovering { .. } => "service_unavailable",
}
}
/// Seconds to advertise in `Retry-After` for the transient variants
/// (#63). `NoHealthyNodes` may clear once the poller re-marks a node
/// healthy; `ModelRecovering` clears once the device context finishes
/// rebuilding — both are safe to retry. Everything else is permanent
/// for this request (404) and carries no hint.
pub fn retry_after_secs(&self) -> Option<u64> {
match self {
RouteError::ModelRecovering { .. } => Some(2),
RouteError::NoHealthyNodes => Some(5),
_ => None,
}
}
}
/// Resolve which node should serve a request for the given model.
@@ -76,11 +129,12 @@ pub async fn resolve(
"alias resolved"
);
}
// Snapshot loaded / unloaded state from the poller cache.
let (loaded_route, unloaded_route, any_healthy) = {
// Snapshot loaded / unloaded / recovering state from the poller cache.
let (loaded_route, unloaded_route, recovering_node, any_healthy) = {
let nodes = fleet.nodes.read().await;
let mut loaded_route = None;
let mut unloaded_route = None;
let mut recovering_node = None;
let mut any_healthy = false;
for node in nodes.values() {
if !node.healthy {
@@ -98,6 +152,17 @@ pub async fn resolve(
unloaded_route = Some((node.name.clone(), node.endpoint.clone(), true));
}
}
// Auto-recovering (#17/#20): the model is rebuilding
// its device context on this node. Hold the route —
// answer "retry shortly" rather than 404, and do NOT
// fall through to the catalogue cold-load, which
// would race a second placement (and a second copy's
// worth of VRAM) against the in-flight recovery.
ModelStatus::Recovering => {
if recovering_node.is_none() {
recovering_node = Some(node.name.clone());
}
}
// Loading is gateway-synthesised from neuron's
// activation snapshot; it never appears on the
// wire from neuron's `/models`. Skip — the model
@@ -110,7 +175,7 @@ pub async fn resolve(
}
}
}
(loaded_route, unloaded_route, any_healthy)
(loaded_route, unloaded_route, recovering_node, any_healthy)
};
if !any_healthy {
@@ -122,12 +187,20 @@ pub async fn resolve(
return finish(fleet, &node_name, &neuron_endpoint, model_id, cold_start).await;
}
// Priority 2: known to neuron but unloaded (neuron's lazy load).
// Priority 2: recovering somewhere — transient hold, not a reroute.
if let Some(node) = recovering_node {
return Err(RouteError::ModelRecovering {
model_id: model_id.to_string(),
node,
});
}
// Priority 3: known to neuron but unloaded (neuron's lazy load).
if let Some((node_name, neuron_endpoint, cold_start)) = unloaded_route {
return finish(fleet, &node_name, &neuron_endpoint, model_id, cold_start).await;
}
// Priority 3: catalogue × topology cold-load.
// Priority 4: catalogue × topology cold-load.
if let Some(profile) = fleet.catalogue.get(model_id) {
let (node_name, neuron_endpoint) = pick_feasible_neuron(fleet, profile).await?;
cold_load(fleet, &node_name, &neuron_endpoint, profile).await?;
@@ -244,6 +317,10 @@ async fn cold_load(
status: ModelStatus::Loaded,
last_accessed: Some(chrono::Utc::now()),
vram_estimate_mb: profile.vram_mb,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -292,7 +369,7 @@ async fn profile_to_spec(
};
ModelSpec {
model_id: profile.id.clone(),
model_id: qualified_model_id(profile),
harness: profile.harness.clone(),
quant: profile.quant.clone(),
tensor_parallel,
@@ -300,6 +377,22 @@ async fn profile_to_spec(
}
}
/// Prefix the catalogue id with the scheme when one is declared, so
/// neuron resolves the load against the right registry. Without this,
/// a profile pointing at the helexa registry would resolve via
/// neuron's `default_source` (typically `huggingface`) and fetch
/// bytes from the wrong place. Profiles that omit `source` continue
/// to pass the bare id through, preserving the pre-Phase-3 contract.
///
/// Stays at module scope (not nested in `profile_to_spec`) so the unit
/// tests can exercise it without spinning up CortexState topology.
fn qualified_model_id(profile: &ModelProfile) -> String {
match profile.source.as_deref() {
Some(scheme) if !scheme.is_empty() => format!("{scheme}:{}", profile.id),
_ => profile.id.clone(),
}
}
/// Resolve neuron's `/models/{id}/endpoint` to its inference URL and
/// build the final `RouteDecision`. Shared by all three priority
/// branches above.
@@ -375,7 +468,46 @@ fn rewrite_loopback_host(inference_url: &str, neuron_endpoint: &str) -> Option<S
#[cfg(test)]
mod tests {
use super::rewrite_loopback_host;
use super::{ModelProfile, qualified_model_id, rewrite_loopback_host};
fn bare_profile(id: &str, source: Option<&str>) -> ModelProfile {
ModelProfile {
id: id.into(),
harness: "candle".into(),
quant: None,
vram_mb: None,
min_devices: 1,
min_device_vram_mb: None,
pinned_on: vec![],
source: source.map(String::from),
limit: None,
cost: None,
capabilities: vec![],
}
}
#[test]
fn qualified_id_passes_through_when_source_absent() {
let p = bare_profile("Qwen/Qwen3-30B", None);
assert_eq!(qualified_model_id(&p), "Qwen/Qwen3-30B");
}
#[test]
fn qualified_id_prefixes_when_source_set() {
let p = bare_profile("Helexa/Qwen3.6-27B-Uncensored", Some("helexa"));
assert_eq!(
qualified_model_id(&p),
"helexa:Helexa/Qwen3.6-27B-Uncensored"
);
}
#[test]
fn qualified_id_passes_through_when_source_is_empty_string() {
// An empty scheme is treated as absent — neuron's default_source
// substitution kicks in.
let p = bare_profile("Qwen/Qwen3-30B", Some(""));
assert_eq!(qualified_model_id(&p), "Qwen/Qwen3-30B");
}
#[test]
fn rewrites_localhost_keeps_port_and_path() {

View File

@@ -74,6 +74,10 @@ async fn test_alias_resolves_in_chat_completions() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -154,6 +158,10 @@ async fn test_aliases_surface_in_v1_models() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(2000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -235,6 +243,10 @@ async fn test_alias_falls_through_for_unmapped_model() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -123,3 +123,212 @@ async fn test_anthropic_invalid_request() {
assert_eq!(resp.status(), 400);
}
/// Tool round-trip: an Anthropic `/v1/messages` request carrying tools
/// (the Claude Code shape: `{name, description, input_schema}`) must
/// reach the upstream neuron reshaped into OpenAI function-tool form,
/// and tool history (`tool_use` / `tool_result` blocks) must become
/// `tool_calls` / `role:"tool"` messages. This is the fix for the
/// failure where the model received malformed tool defs and improvised
/// an unparseable `<tool_use_name>` format.
#[tokio::test]
async fn test_anthropic_tools_reshaped_for_upstream() {
let (mock_url, captured) = common::spawn_capturing_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 100,
"tools": [{
"name": "Read",
"description": "Read a file from disk",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}],
"tool_choice": {"type": "auto"},
"messages": [
{"role": "user", "content": "read /etc/hosts"},
{"role": "assistant", "content": [
{"type": "text", "text": "Reading it."},
{"type": "tool_use", "id": "toolu_42", "name": "Read",
"input": {"path": "/etc/hosts"}}
]},
{"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "toolu_42",
"content": "127.0.0.1 localhost"}
]}
]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let forwarded = {
let guard = captured.lock().unwrap();
guard.last().cloned().expect("upstream received a request")
};
// Tool definitions reshaped to OpenAI function form.
let tools = forwarded["tools"].as_array().expect("tools array");
assert_eq!(tools[0]["type"], "function");
assert_eq!(tools[0]["function"]["name"], "Read");
assert_eq!(
tools[0]["function"]["parameters"]["properties"]["path"]["type"],
"string"
);
assert!(tools[0]["function"].get("input_schema").is_none());
// tool_choice mapped.
assert_eq!(forwarded["tool_choice"], "auto");
// Message history: user, assistant(+tool_calls), tool, user.
let msgs = forwarded["messages"].as_array().expect("messages array");
let assistant = msgs
.iter()
.find(|m| m["role"] == "assistant")
.expect("assistant turn");
assert_eq!(assistant["tool_calls"][0]["id"], "toolu_42");
assert_eq!(assistant["tool_calls"][0]["function"]["name"], "Read");
// arguments is the parsed object, not a JSON string — the Qwen3.6
// chat template iterates `tool_call.arguments | items`.
assert_eq!(
assistant["tool_calls"][0]["function"]["arguments"],
json!({"path": "/etc/hosts"})
);
let tool_msg = msgs
.iter()
.find(|m| m["role"] == "tool")
.expect("tool result turn");
assert_eq!(tool_msg["tool_call_id"], "toolu_42");
assert_eq!(tool_msg["content"], "127.0.0.1 localhost");
}
/// #24: a streaming Anthropic request gets a translated Anthropic SSE
/// stream — not raw OpenAI frames. Verifies the full event sequence,
/// text reassembly, and the content type.
#[tokio::test]
async fn test_anthropic_streaming_sse_translation() {
let mock_url =
common::spawn_streaming_mock_neuron(4, std::time::Duration::from_millis(20)).await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 64,
"stream": true,
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
assert!(
resp.headers()
.get("content-type")
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.starts_with("text/event-stream"),
"anthropic stream must be SSE"
);
let body = resp.text().await.expect("stream should complete");
assert!(
!body.contains("chat.completion.chunk"),
"raw OpenAI frames must not leak through:\n{body}"
);
let event_names: Vec<&str> = body
.lines()
.filter_map(|l| l.strip_prefix("event: "))
.collect();
assert_eq!(
event_names,
vec![
"message_start",
"content_block_start",
"content_block_delta",
"content_block_delta",
"content_block_delta",
"content_block_delta",
"content_block_stop",
"message_delta",
"message_stop",
],
"unexpected event sequence:\n{body}"
);
// Reassemble the text deltas: the mock emits token0..token3.
let text: String = body
.lines()
.filter_map(|l| l.strip_prefix("data: "))
.filter_map(|d| serde_json::from_str::<serde_json::Value>(d).ok())
.filter(|v| v["type"] == "content_block_delta")
.filter_map(|v| v["delta"]["text"].as_str().map(String::from))
.collect();
assert_eq!(text, "token0token1token2token3");
// The mock sends no finish_reason — stop_reason defaults to
// end_turn, and output_tokens falls back to the delta count.
let message_delta = body
.lines()
.filter_map(|l| l.strip_prefix("data: "))
.filter_map(|d| serde_json::from_str::<serde_json::Value>(d).ok())
.find(|v| v["type"] == "message_delta")
.expect("message_delta event present");
assert_eq!(message_delta["delta"]["stop_reason"], "end_turn");
assert_eq!(message_delta["usage"]["output_tokens"], 4);
}
/// #24: an upstream usage frame (stream_options include_usage shape)
/// rides into message_delta as input/output token counts.
#[tokio::test]
async fn test_anthropic_streaming_usage_propagation() {
let mock_url = common::spawn_streaming_mock_neuron_with_usage(
3,
std::time::Duration::from_millis(10),
225,
42,
)
.await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let body = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 64,
"stream": true,
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed")
.text()
.await
.expect("stream should complete");
let message_delta = body
.lines()
.filter_map(|l| l.strip_prefix("data: "))
.filter_map(|d| serde_json::from_str::<serde_json::Value>(d).ok())
.find(|v| v["type"] == "message_delta")
.expect("message_delta event present");
assert_eq!(message_delta["usage"]["output_tokens"], 42);
assert_eq!(message_delta["usage"]["input_tokens"], 225);
}

View File

@@ -54,9 +54,64 @@ pub async fn spawn_mock_neuron() -> String {
base_url
}
/// Like [`spawn_mock_neuron`] but captures the JSON body of every
/// `POST /v1/chat/completions` it receives into the returned handle, so
/// a test can assert what the gateway *actually forwarded upstream*
/// (e.g. that Anthropic-shaped tools were reshaped to OpenAI form).
pub async fn spawn_capturing_mock_neuron() -> (String, Arc<std::sync::Mutex<Vec<Value>>>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let captured: Arc<std::sync::Mutex<Vec<Value>>> = Arc::new(std::sync::Mutex::new(Vec::new()));
let sink = captured.clone();
let app = Router::new()
.route("/models", get(mock_neuron_list_models))
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({"url": url})) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| {
let sink = sink.clone();
async move {
let model = body
.get("model")
.and_then(|v| v.as_str())
.unwrap_or("unknown");
let resp = json!({
"id": "chatcmpl-capture-001",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Hello from mock backend"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15}
});
sink.lock().unwrap().push(body);
Json(resp)
}
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, captured)
}
async fn mock_neuron_list_models() -> Json<Value> {
Json(json!([
{"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000}
{"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000, "capabilities": ["text"], "tool_call": false, "reasoning": false}
]))
}
@@ -196,6 +251,91 @@ pub async fn spawn_streaming_mock_neuron(chunk_count: usize, chunk_delay: Durati
base_url
}
/// Like `spawn_streaming_mock_neuron`, but the stream ends with an
/// OpenAI `stream_options.include_usage`-style final chunk (empty
/// choices + usage object) before `[DONE]` — the shape the gateway's
/// token metrics (#21) extract counts from.
pub async fn spawn_streaming_mock_neuron_with_usage(
chunk_count: usize,
chunk_delay: Duration,
prompt_tokens: u64,
completion_tokens: u64,
) -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let app = Router::new()
.route("/models", get(mock_neuron_list_models))
.route(
"/models/{model_id}/endpoint",
get(move |Path(_model_id): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({"url": url})) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| async move {
let model = body
.get("model")
.and_then(|v| v.as_str())
.unwrap_or("unknown")
.to_string();
let mut chunks: Vec<String> = (0..chunk_count)
.map(|i| {
let chunk = json!({
"id": "chatcmpl-stream-002",
"object": "chat.completion.chunk",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"delta": { "content": format!("token{i}") },
"finish_reason": null
}]
});
format!("data: {chunk}\n\n")
})
.collect();
let usage_chunk = json!({
"id": "chatcmpl-stream-002",
"object": "chat.completion.chunk",
"created": 1700000000_u64,
"model": model,
"choices": [],
"usage": {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens
}
});
chunks.push(format!("data: {usage_chunk}\n\n"));
chunks.push("data: [DONE]\n\n".to_string());
let delay = chunk_delay;
let stream = stream::iter(chunks).then(move |chunk| async move {
tokio::time::sleep(delay).await;
Ok::<_, std::convert::Infallible>(chunk)
});
Response::builder()
.header(header::CONTENT_TYPE, "text/event-stream")
.header(header::CACHE_CONTROL, "no-cache")
.body(Body::from_stream(stream))
.unwrap()
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
base_url
}
/// Spawns a mock neuron with a custom models list.
pub async fn spawn_mock_neuron_with_models(models_response: Value) -> String {
spawn_mock_neuron_with_models_and_health(models_response, default_health_response()).await
@@ -305,6 +445,10 @@ pub async fn spawn_gateway_with_state(mock_url: &str) -> (Arc<CortexState>, Stri
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -0,0 +1,139 @@
mod common;
use serde_json::json;
#[tokio::test]
async fn error_response_model_not_found() {
let neuron_url = common::spawn_mock_neuron().await;
let gateway_url = common::spawn_gateway(&neuron_url).await;
let client = reqwest::Client::new();
// Request a model that isn't loaded on the mock neuron.
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"model": "nonexistent-model",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::NOT_FOUND);
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
// Broad type categorization
assert_eq!(err.get("type").unwrap(), "invalid_request_error");
// Specific machine-readable code
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"model_not_found"
);
// param is always null
assert!(err.get("param").unwrap().is_null());
}
#[tokio::test]
async fn error_response_missing_model_field() {
let neuron_url = common::spawn_mock_neuron().await;
let gateway_url = common::spawn_gateway(&neuron_url).await;
let client = reqwest::Client::new();
// Request without the required `model` field.
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::BAD_REQUEST);
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
assert_eq!(err.get("type").unwrap(), "invalid_request_error");
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"missing_model_field"
);
assert!(err.get("param").unwrap().is_null());
}
#[tokio::test]
async fn error_response_no_healthy_nodes() {
use cortex_core::config::{EvictionSettings, GatewayConfig, GatewaySettings, NeuronEndpoint};
use std::sync::Arc;
// Create a gateway config with a neuron pointing at an unreachable port so no node is ever healthy.
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: cortex_core::config::EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "dead-node".into(),
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
};
let fleet = Arc::new(cortex_gateway::state::CortexState::from_config(&config));
let app = cortex_gateway::build_app(fleet);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
// Allow the poller a moment to mark the node unhealthy.
tokio::time::sleep(std::time::Duration::from_millis(200)).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("http://{addr}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"model": "any-model",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::SERVICE_UNAVAILABLE);
// Transient 503 — the gateway advertises Retry-After so OpenAI-compatible
// clients back off and retry rather than surfacing an opaque error (#63).
let retry_after = resp
.headers()
.get(reqwest::header::RETRY_AFTER)
.expect("transient 503 must carry Retry-After")
.to_str()
.unwrap()
.to_string();
assert_eq!(retry_after, "5");
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
assert_eq!(err.get("type").unwrap(), "api_error");
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"service_unavailable"
);
assert!(err.get("param").unwrap().is_null());
}

View File

@@ -91,6 +91,10 @@ async fn test_evict_lru_model() {
status: ModelStatus::Loaded,
last_accessed: Some(Utc::now() - chrono::Duration::hours(2)),
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
node.models.insert(
@@ -100,6 +104,10 @@ async fn test_evict_lru_model() {
status: ModelStatus::Loaded,
last_accessed: Some(Utc::now()),
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -163,6 +171,10 @@ async fn test_eviction_increments_lifecycle_cycles() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -1,20 +1,26 @@
mod common;
use serde_json::json;
use std::sync::OnceLock;
/// The metrics recorder is a process-wide global; both tests in this
/// binary run against one shared install. Assertions must therefore be
/// order-independent (presence of names / monotonic counters, not
/// "empty before").
fn recorder() -> &'static metrics_exporter_prometheus::PrometheusHandle {
static HANDLE: OnceLock<metrics_exporter_prometheus::PrometheusHandle> = OnceLock::new();
HANDLE.get_or_init(|| {
cortex_gateway::metrics::install_test_recorder().expect("recorder should install")
})
}
#[tokio::test]
async fn test_metrics_emitted_after_proxy() {
let handle = cortex_gateway::metrics::install_test_recorder().expect("recorder should install");
let handle = recorder();
let mock_url = common::spawn_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let before = handle.render();
assert!(
!before.contains("cortex_requests_total"),
"no request metrics before any requests"
);
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
@@ -44,3 +50,72 @@ async fn test_metrics_emitted_after_proxy() {
"no errors expected for a successful request"
);
}
#[tokio::test]
async fn test_token_metrics_emitted_for_streamed_request() {
// #21: a streamed chat completion with a final usage chunk must
// produce TTFT + tok/s histograms and prompt/completion token
// counters, labelled with model and node. The recorder is global
// per-process, so this test runs in its own binary invocation —
// cargo's per-file integration binaries give us that as long as
// only one test in this file installs the recorder... it isn't:
// test_metrics_emitted_after_proxy also installs. Whichever wins
// the race, both render from the same recorder, so assert on
// delta-able names rather than exact totals.
let handle = recorder();
let mock_url = common::spawn_streaming_mock_neuron_with_usage(
5,
std::time::Duration::from_millis(40),
225,
42,
)
.await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let body = resp.text().await.expect("stream should complete");
assert!(body.contains("[DONE]"));
let rendered = handle.render();
for needle in [
"cortex_time_to_first_token_seconds",
"cortex_tokens_per_second",
] {
assert!(
rendered.contains(needle),
"{needle} should be present.\nMetrics:\n{rendered}"
);
}
// The recorder is shared with the sibling test (same model/node
// labels), so counters are lower bounds, not exact values: this
// request contributed prompt=225 / completion=42.
let counter_value = |name: &str| -> u64 {
rendered
.lines()
.find(|l| l.starts_with(name) && l.contains(r#"model="test-model""#))
.and_then(|l| l.rsplit(' ').next())
.and_then(|v| v.parse().ok())
.unwrap_or_else(|| panic!("{name} should be present.\nMetrics:\n{rendered}"))
};
assert!(
counter_value("cortex_prompt_tokens_total") >= 225,
"prompt token counter should include this request's 225.\nMetrics:\n{rendered}"
);
assert!(
counter_value("cortex_completion_tokens_total") >= 42,
"completion token counter should include this request's 42.\nMetrics:\n{rendered}"
);
}

View File

@@ -0,0 +1,131 @@
//! Issue #62 / #67: `GET /v1/models` advertises a per-model serving budget so
//! an OpenAI-compatible client (opencode's helexa provider) can size and
//! compact its context without hand-configuration.
//!
//! Asserts the composition sources land on the response:
//! - `limit` from the neuron's self-derived value (#67) — NOT the catalogue;
//! an operator-declared catalogue `limit` is deliberately ignored.
//! - `cost` from the catalogue profile (operator-set pricing).
//! - `tool_call` / `reasoning` from the neuron's runtime detection (OR-ed in)
//!
//! Also a regression guard for the removal of `max_model_len` — the misnamed,
//! unconsumed vLLM-ism that this contract replaces.
use cortex_core::config::{
EvictionSettings, EvictionStrategy, GatewayConfig, GatewaySettings, NeuronEndpoint,
};
use cortex_core::harness::ModelLimit;
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use std::sync::Arc;
use tokio::net::TcpListener;
#[tokio::test]
async fn v1_models_surfaces_limit_cost_and_capability_flags() {
// Catalogue declares pricing + an operator `limit` that must be IGNORED
// (#67): the neuron's self-derived limit is authoritative.
let models_toml = r#"
[[models]]
id = "test-model"
harness = "candle"
limit.context = 999999
limit.input = 999999
limit.output = 999999
cost.input = 0.0
cost.output = 0.0
capabilities = ["text"]
"#;
let cat_path = std::env::temp_dir().join("cortex_test_issue62_models.toml");
std::fs::write(&cat_path, models_toml).unwrap();
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
// Never contacted: build_app does not spawn the poller, so the
// seeded state below is authoritative for /v1/models.
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: cat_path.to_string_lossy().into_owned(),
};
let fleet = Arc::new(CortexState::from_config(&config));
// Seed the model as loaded on the node with runtime-detected flags set —
// these must OR into the catalogue entry, not be lost.
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node exists");
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: vec!["text".into()],
tool_call: true,
reasoning: true,
// Neuron's self-derived limit (#67) — the authoritative
// source. Distinct from the catalogue's (ignored) values.
limit: Some(ModelLimit {
context: 49152,
input: Some(40960),
output: 8192,
}),
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let body: serde_json::Value = reqwest::Client::new()
.get(format!("http://{addr}/v1/models"))
.send()
.await
.unwrap()
.json()
.await
.unwrap();
let entry = body["data"]
.as_array()
.expect("data is an array")
.iter()
.find(|m| m["id"] == "test-model")
.expect("test-model present in /v1/models");
// `limit` is the neuron's self-derived value (#67), NOT the catalogue's
// (which declared 999999 and must be ignored). `cost` still flows from
// the catalogue.
assert_eq!(entry["limit"]["context"], 49152);
assert_eq!(entry["limit"]["input"], 40960);
assert_eq!(entry["limit"]["output"], 8192);
assert_eq!(entry["cost"]["input"], 0.0);
assert_eq!(entry["cost"]["output"], 0.0);
// Runtime-detected capability flags OR-ed in from the neuron's ModelEntry.
assert_eq!(entry["tool_call"], true);
assert_eq!(entry["reasoning"], true);
// Regression guard: the removed, unconsumed vLLM-ism must not reappear.
assert!(
entry.get("max_model_len").is_none(),
"max_model_len was removed; /v1/models must not advertise it"
);
let _ = std::fs::remove_file(&cat_path);
}

View File

@@ -118,6 +118,87 @@ async fn test_poller_updates_gateway_models_endpoint() {
}
}
#[tokio::test]
async fn test_models_endpoint_unions_capabilities_across_nodes() {
// C3: two neurons each have the same model loaded but advertise
// different capability sets. The gateway's /v1/models must report
// the union — a model loaded text-only on one node and
// text+vision on another is vision-capable to the fleet.
let node_a = common::spawn_mock_neuron_with_models(json!([
{"id": "shared-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null, "capabilities": ["text"]}
]))
.await;
let node_b = common::spawn_mock_neuron_with_models(json!([
{"id": "shared-model", "harness": "candle", "status": "loaded", "devices": [1], "vram_used_mb": null, "capabilities": ["text", "vision"]}
]))
.await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![
NeuronEndpoint {
name: "node-a".into(),
endpoint: node_a,
},
NeuronEndpoint {
name: "node-b".into(),
endpoint: node_b,
},
],
models_config: "/dev/null".into(),
};
let fleet = Arc::new(CortexState::from_config(&config));
cortex_gateway::poller::poll_once(&fleet).await;
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let client = reqwest::Client::new();
let body: serde_json::Value = client
.get(format!("http://{addr}/v1/models"))
.send()
.await
.expect("request should succeed")
.json()
.await
.unwrap();
let model = body["data"]
.as_array()
.expect("data array")
.iter()
.find(|m| m["id"] == "shared-model")
.expect("shared-model should be present");
let caps: Vec<&str> = model["capabilities"]
.as_array()
.expect("capabilities array")
.iter()
.filter_map(|c| c.as_str())
.collect();
assert!(caps.contains(&"text"), "union must include text: {caps:?}");
assert!(
caps.contains(&"vision"),
"union must include vision: {caps:?}"
);
assert_eq!(caps.len(), 2, "union must not duplicate text: {caps:?}");
// Both nodes hold the model, so two locations regardless of caps.
assert_eq!(model["locations"].as_array().unwrap().len(), 2);
}
#[tokio::test]
async fn test_poller_marks_unreachable_node_unhealthy() {
let config = GatewayConfig {
@@ -216,6 +297,10 @@ async fn test_poller_removes_stale_models() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
node.models.insert(
@@ -225,6 +310,10 @@ async fn test_poller_removes_stale_models() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -292,3 +381,39 @@ async fn test_poller_captures_activation_from_health() {
assert_eq!(activation.in_progress.as_deref(), Some("Qwen/model-x"));
assert_eq!(activation.pending, vec!["Qwen/model-y".to_string()]);
}
#[tokio::test]
async fn test_poller_parses_recovering_status() {
// #20: a model auto-recovering on a neuron (poisoned → unload →
// reload, #17) is reported with status "recovering" and must land
// in gateway state as the dedicated Recovering status — not fall
// through the parser's catch-all to Loaded.
let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "model-r", "harness": "candle", "status": "recovering", "devices": [0, 1], "vram_used_mb": null}
]))
.await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "test-node".into(),
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
};
let fleet = Arc::new(CortexState::from_config(&config));
cortex_gateway::poller::poll_once(&fleet).await;
let nodes = fleet.nodes.read().await;
let node = nodes.get("test-node").unwrap();
let model_r = node.models.get("model-r").expect("model-r should exist");
assert_eq!(model_r.status, ModelStatus::Recovering);
}

View File

@@ -139,7 +139,7 @@ async fn test_no_healthy_nodes() {
.await
.expect("request should succeed");
assert_eq!(resp.status(), 404);
assert_eq!(resp.status(), 503);
let body: serde_json::Value = resp.json().await.unwrap();
assert!(
@@ -171,3 +171,67 @@ async fn test_missing_model_field() {
let body: serde_json::Value = resp.json().await.unwrap();
assert!(body["error"]["message"].as_str().unwrap().contains("model"));
}
#[tokio::test]
async fn test_recovering_model_returns_503_and_stays_listed() {
// #20: while a model auto-recovers on a neuron, the gateway must
// hold the route — transient 503 ("retry shortly"), not the 404
// "not found on any node" that makes a recovering model look
// evicted — and keep listing it on /v1/models.
let mock_url = common::spawn_mock_neuron().await;
let (fleet, gw_url) = common::spawn_gateway_with_state(&mock_url).await;
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node must exist");
node.models.insert(
"recovering-model".into(),
cortex_core::node::ModelEntry {
id: "recovering-model".into(),
status: cortex_core::node::ModelStatus::Recovering,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
.header("content-type", "application/json")
.json(&json!({
"model": "recovering-model",
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 503);
let body: serde_json::Value = resp.json().await.unwrap();
let message = body["error"]["message"].as_str().unwrap();
assert!(
message.contains("recovering") && message.contains("retry"),
"503 body must say recovering/retry, got: {message}"
);
// The model must still be visible on the unified models endpoint.
let models: serde_json::Value = client
.get(format!("{gw_url}/v1/models"))
.send()
.await
.expect("models request should succeed")
.json()
.await
.unwrap();
let listed = models["data"]
.as_array()
.unwrap()
.iter()
.any(|m| m["id"] == "recovering-model");
assert!(listed, "recovering model must stay listed on /v1/models");
}

View File

@@ -3,7 +3,7 @@ name = "helexa-acp"
version = "0.1.16"
edition = "2024"
license = "Apache-2.0"
repository = "https://git.lair.cafe/helexa/cortex"
repository = "https://git.lair.cafe/helexa/helexa"
description = """
Agent Client Protocol bridge for the helexa self-hosted LLM stack.
Speaks ACP to ACP-compatible editor clients (Zed, etc.) and forwards

View File

@@ -58,8 +58,8 @@ one vendor's agent client.
### From source
```sh
git clone https://git.lair.cafe/helexa/cortex.git
cd cortex
git clone https://git.lair.cafe/helexa/helexa.git
cd helexa
cargo install --path crates/helexa-acp
# Binary lands at ~/.cargo/bin/helexa-acp
```
@@ -536,7 +536,7 @@ Cargo.toml-only.
## Contributing
Repository: https://git.lair.cafe/helexa/cortex (`crates/helexa-acp/`).
Repository: https://git.lair.cafe/helexa/helexa (`crates/helexa-acp/`).
Issues / PRs welcome. The canonical staged plan is in
`~/.claude/plans/plan-the-per-device-worker-abstract-micali.md` on
the maintainer's machine; the substages 3a3e and 6a/6b that the

View File

@@ -0,0 +1,41 @@
[package]
name = "helexa-bench"
version.workspace = true
edition.workspace = true
license.workspace = true
repository.workspace = true
[[bin]]
name = "helexa-bench"
path = "src/main.rs"
[dependencies]
cortex-core = { workspace = true }
tokio = { workspace = true }
reqwest = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
figment = { workspace = true }
anyhow = { workspace = true }
async-trait = { workspace = true }
clap = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { workspace = true }
chrono = { workspace = true }
futures = { workspace = true }
tokio-stream = { workspace = true }
eventsource-stream = { workspace = true }
# read-only JSON API (api.rs)
axum = { workspace = true }
tower-http = { workspace = true }
# SQLite system-of-record. `bundled` compiles SQLite from source so the
# binary has no libsqlite3 runtime dependency — matches the project's
# single-static-binary packaging.
rusqlite = { version = "0.32", features = ["bundled"] }
[dev-dependencies]
# Jail (isolated cwd + env) for config tests.
figment = { workspace = true, features = ["test"] }

View File

@@ -0,0 +1,119 @@
//! Read-only JSON API over the bench SQLite store.
//!
//! Consumed by the `bench/` visualisation app and for programmatic
//! access. Served by the `run` daemon (alongside the sweep loop) and by
//! the standalone `serve` subcommand. CORS is permissive because the UI
//! is hosted separately (different origin); the API is internal-only
//! (WireGuard + firewalld) and read-only, so this predates the auth epic.
use crate::store::{RunFilter, Store};
use anyhow::Result;
use axum::Router;
use axum::extract::{Query, State};
use axum::http::StatusCode;
use axum::response::Json;
use axum::routing::get;
use serde::Deserialize;
use serde_json::json;
use std::sync::Arc;
use tokio::sync::Mutex;
use tower_http::cors::CorsLayer;
/// Shared API state: a dedicated read connection to the store, guarded
/// (rusqlite `Connection` isn't `Sync`). Separate from the sweep's
/// writer connection — WAL lets them run concurrently.
pub type ApiState = Arc<Mutex<Store>>;
/// Open an API state over the store at `db_path`.
pub fn open_state(db_path: &str) -> Result<ApiState> {
Ok(Arc::new(Mutex::new(Store::open(db_path)?)))
}
/// Build the API router.
pub fn api_routes(state: ApiState) -> Router {
Router::new()
.route("/api/health", get(health))
.route("/api/dimensions", get(dimensions))
.route("/api/summary", get(summary))
.route("/api/series", get(series))
.route("/api/runs", get(runs))
.layer(CorsLayer::permissive())
.with_state(state)
}
/// Bind `listen` and serve the API until the process exits.
pub async fn serve(listen: &str, state: ApiState) -> Result<()> {
let listener = tokio::net::TcpListener::bind(listen).await?;
tracing::info!(%listen, "bench API listening");
axum::serve(listener, api_routes(state)).await?;
Ok(())
}
type ApiError = (StatusCode, String);
fn err500(e: anyhow::Error) -> ApiError {
(StatusCode::INTERNAL_SERVER_ERROR, format!("{e:#}"))
}
async fn health(State(s): State<ApiState>) -> Result<Json<serde_json::Value>, ApiError> {
let store = s.lock().await;
let count = store.run_count().map_err(err500)?;
Ok(Json(json!({ "status": "ok", "run_count": count })))
}
async fn dimensions(State(s): State<ApiState>) -> Result<Json<crate::store::Dimensions>, ApiError> {
let store = s.lock().await;
store.dimensions().map(Json).map_err(err500)
}
async fn summary(
State(s): State<ApiState>,
) -> Result<Json<Vec<crate::store::ReportRow>>, ApiError> {
let store = s.lock().await;
store.summary().map(Json).map_err(err500)
}
#[derive(Debug, Deserialize)]
struct SeriesQuery {
/// Optional — when omitted the store resolves the host serving this model.
host: Option<String>,
model: String,
scenario: String,
}
async fn series(
State(s): State<ApiState>,
Query(q): Query<SeriesQuery>,
) -> Result<Json<Vec<crate::store::SeriesPoint>>, ApiError> {
let store = s.lock().await;
store
.series(q.host.as_deref(), &q.model, &q.scenario)
.map(Json)
.map_err(err500)
}
#[derive(Debug, Deserialize)]
struct RunsQuery {
host: Option<String>,
model: Option<String>,
scenario: Option<String>,
sha: Option<String>,
ok: Option<bool>,
limit: Option<u32>,
}
async fn runs(
State(s): State<ApiState>,
Query(q): Query<RunsQuery>,
) -> Result<Json<Vec<crate::store::RunRow>>, ApiError> {
let filter = RunFilter {
host: q.host,
model: q.model,
scenario: q.scenario,
sha: q.sha,
ok: q.ok,
limit: q.limit,
};
let store = s.lock().await;
store.runs(&filter).map(Json).map_err(err500)
}

View File

@@ -0,0 +1,163 @@
//! Outbound calls to a benchmark target: build identity, host discovery,
//! and warm-model enumeration. Neuron targets use the native neuron API;
//! `openai` targets use the OpenAI-compatible surface (preliminary).
use crate::config::{TargetConfig, TargetKind};
use anyhow::{Context, Result};
use cortex_core::build_info::BuildInfo;
use cortex_core::discovery::DiscoveryResponse;
use cortex_core::harness::ModelInfo;
use cortex_core::openai::ModelsResponse;
use std::time::Duration;
/// How long to wait on the cheap metadata polls (version/discovery/models).
const META_TIMEOUT: Duration = Duration::from_secs(10);
pub struct TargetClient {
http: reqwest::Client,
}
impl TargetClient {
pub fn new(request_timeout: Duration) -> Result<Self> {
let http = reqwest::Client::builder()
.timeout(request_timeout)
.build()
.context("building HTTP client")?;
Ok(TargetClient { http })
}
pub fn http(&self) -> &reqwest::Client {
&self.http
}
/// Chat-completions URL for the target.
pub fn chat_url(&self, target: &TargetConfig) -> String {
let base = target.endpoint.trim_end_matches('/');
match target.kind {
// neuron exposes OpenAI routes under /v1.
TargetKind::Neuron => format!("{base}/v1/chat/completions"),
// openai endpoint is the /v1 base already (bench.py convention).
TargetKind::Openai => format!("{base}/chat/completions"),
}
}
/// Build identity. Neuron: `GET /version`. Openai: a synthetic
/// placeholder keyed by `"external"` so the version-aware skip logic
/// treats it as one stable build (comparison runs are manual anyway).
pub async fn fetch_version(&self, target: &TargetConfig) -> Result<BuildInfo> {
match target.kind {
TargetKind::Neuron => {
let base = target.endpoint.trim_end_matches('/');
let info = self
.http
.get(format!("{base}/version"))
.timeout(META_TIMEOUT)
.send()
.await
.context("GET /version")?
.error_for_status()
.context("GET /version status")?
.json::<BuildInfo>()
.await
.context("decoding /version")?;
Ok(info)
}
TargetKind::Openai => {
let mut info = BuildInfo::unknown();
info.git_sha = "external".to_string();
Ok(info)
}
}
}
/// Host discovery (neuron only).
pub async fn fetch_discovery(
&self,
target: &TargetConfig,
) -> Result<Option<DiscoveryResponse>> {
if target.kind != TargetKind::Neuron {
return Ok(None);
}
let base = target.endpoint.trim_end_matches('/');
let disco = self
.http
.get(format!("{base}/discovery"))
.timeout(META_TIMEOUT)
.send()
.await
.context("GET /discovery")?
.error_for_status()
.context("GET /discovery status")?
.json::<DiscoveryResponse>()
.await
.context("decoding /discovery")?;
Ok(Some(disco))
}
/// Warm models — those ready to serve without a cold load.
///
/// Neuron: `GET /models` filtered to `status == "loaded"` (skips
/// `recovering`/`poisoned`). Openai: `GET /models`, honouring the
/// helexa `loaded` extension when present, else treating all listed
/// models as warm.
pub async fn warm_models(&self, target: &TargetConfig) -> Result<Vec<ModelInfo>> {
let base = target.endpoint.trim_end_matches('/');
match target.kind {
TargetKind::Neuron => {
let models = self
.http
.get(format!("{base}/models"))
.timeout(META_TIMEOUT)
.send()
.await
.context("GET /models")?
.error_for_status()
.context("GET /models status")?
.json::<Vec<ModelInfo>>()
.await
.context("decoding /models")?;
Ok(models
.into_iter()
.filter(|m| m.status == "loaded")
.collect())
}
TargetKind::Openai => {
let resp = self
.http
.get(format!("{base}/models"))
.timeout(META_TIMEOUT)
.send()
.await
.context("GET /models")?
.error_for_status()
.context("GET /models status")?
.json::<ModelsResponse>()
.await
.context("decoding /models")?;
Ok(resp
.data
.into_iter()
.filter(|m| {
// honour the helexa `loaded` extension if present
m.extra
.get("loaded")
.and_then(|v| v.as_bool())
.unwrap_or(true)
})
.map(|m| ModelInfo {
id: m.id,
harness: "openai".to_string(),
status: "loaded".to_string(),
devices: Vec::new(),
vram_used_mb: None,
capabilities: Vec::new(),
limit: None,
cost: None,
tool_call: false,
reasoning: false,
})
.collect())
}
}
}
}

View File

@@ -0,0 +1,240 @@
//! Bench configuration: loaded from `helexa-bench.toml` with figment,
//! `BENCH_`-prefixed env overrides (mirrors `NeuronConfig::load`).
use figment::{
Figment,
providers::{Env, Format, Toml},
};
use serde::{Deserialize, Serialize};
use std::path::Path;
use std::time::Duration;
/// Top-level bench config.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BenchConfig {
#[serde(default)]
pub bench: BenchSettings,
#[serde(default)]
pub scenarios: ScenarioConfig,
/// Read-only JSON API (consumed by the bench UI + programmatic access).
#[serde(default)]
pub api: ApiSettings,
/// Endpoints to benchmark. At least one is required for `run`/`once`.
#[serde(default)]
pub targets: Vec<TargetConfig>,
}
/// The read-only HTTP API the `run` daemon (and the `serve` subcommand)
/// exposes over the SQLite store.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ApiSettings {
/// Whether to bind the API at all.
#[serde(default = "default_api_enabled")]
pub enabled: bool,
/// Listen address for the API.
#[serde(default = "default_api_listen")]
pub listen: String,
}
impl Default for ApiSettings {
fn default() -> Self {
ApiSettings {
enabled: default_api_enabled(),
listen: default_api_listen(),
}
}
}
/// Loop/timing knobs.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BenchSettings {
/// Pause between full sweeps of all targets.
#[serde(default = "default_sweep_interval")]
pub sweep_interval_secs: u64,
/// Target number of measured samples to record for a given
/// (target, build SHA, model, scenario). Once met, later sweeps skip
/// that cell — so a fully-sampled build costs only cheap version
/// polls until a new SHA ships.
#[serde(default = "default_samples")]
pub samples_per_version: u32,
/// Pause between successive measured iterations against one model.
#[serde(default = "default_iter_pause")]
pub iteration_pause_secs: u64,
/// Per-request timeout (cold lazy-loads can be slow; generous like
/// bench.py's 600s default).
#[serde(default = "default_timeout")]
pub request_timeout_secs: u64,
/// SQLite system-of-record path.
#[serde(default = "default_db_path")]
pub db_path: String,
}
impl Default for BenchSettings {
fn default() -> Self {
BenchSettings {
sweep_interval_secs: default_sweep_interval(),
samples_per_version: default_samples(),
iteration_pause_secs: default_iter_pause(),
request_timeout_secs: default_timeout(),
db_path: default_db_path(),
}
}
}
impl BenchSettings {
pub fn iteration_pause(&self) -> Duration {
Duration::from_secs(self.iteration_pause_secs)
}
pub fn request_timeout(&self) -> Duration {
Duration::from_secs(self.request_timeout_secs)
}
pub fn sweep_interval(&self) -> Duration {
Duration::from_secs(self.sweep_interval_secs)
}
}
/// Which scenarios to run and their shared parameters.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ScenarioConfig {
/// Approximate prompt sizes (in tokens) — one chat-latency scenario
/// is generated per size, e.g. `chat:128`, `chat:4096`. This is the
/// per-cell dimension that the version-aware skip logic keys on.
#[serde(default = "default_prompt_sizes")]
pub prompt_sizes: Vec<u32>,
/// Max generated tokens per request.
#[serde(default = "default_max_tokens")]
pub max_tokens: u64,
}
impl Default for ScenarioConfig {
fn default() -> Self {
ScenarioConfig {
prompt_sizes: default_prompt_sizes(),
max_tokens: default_max_tokens(),
}
}
}
/// One endpoint to benchmark.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TargetConfig {
/// Stable label used as the engine column and in the DB.
pub name: String,
/// Which protocol/metadata surface the target exposes.
#[serde(default)]
pub kind: TargetKind,
/// Base URL. For `neuron`: the daemon root (e.g.
/// `http://beast.internal:13131`). For `openai`: the OpenAI `/v1`
/// base (e.g. `http://host:8080/v1`).
pub endpoint: String,
/// Optional display label override for reports (defaults to `name`).
#[serde(default)]
pub label: Option<String>,
}
impl TargetConfig {
pub fn display_label(&self) -> &str {
self.label.as_deref().unwrap_or(&self.name)
}
}
/// The two target surfaces. `neuron` gets rich build metadata and warm
/// model discovery via the native neuron API; `openai` is the seam for
/// later comparison against mistral.rs / llama.cpp / vLLM (phase 1
/// implements `neuron` fully; `openai` is preliminary plumbing).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
#[serde(rename_all = "snake_case")]
pub enum TargetKind {
#[default]
Neuron,
Openai,
}
impl BenchConfig {
pub fn load(path: impl AsRef<Path>) -> Result<Self, Box<figment::Error>> {
Figment::new()
.merge(Toml::file(path))
.merge(Env::prefixed("BENCH_").split("__"))
.extract()
.map_err(Box::new)
}
}
fn default_sweep_interval() -> u64 {
1800
}
fn default_samples() -> u32 {
5
}
fn default_iter_pause() -> u64 {
2
}
fn default_timeout() -> u64 {
600
}
fn default_db_path() -> String {
"/var/lib/helexa-bench/bench.sqlite".to_string()
}
fn default_api_enabled() -> bool {
true
}
fn default_api_listen() -> String {
"0.0.0.0:13132".to_string()
}
fn default_prompt_sizes() -> Vec<u32> {
vec![128, 4096]
}
fn default_max_tokens() -> u64 {
256
}
#[cfg(test)]
// Jail's closure must return figment::Result; the large-Err type is
// figment's, not ours, so suppress the lint here.
#[allow(clippy::result_large_err)]
mod tests {
use super::*;
use figment::Jail;
#[test]
fn loads_minimal_with_defaults() {
Jail::expect_with(|jail| {
jail.create_file(
"helexa-bench.toml",
r#"
[[targets]]
name = "beast"
endpoint = "http://beast.internal:13131"
"#,
)?;
let cfg = BenchConfig::load("helexa-bench.toml").unwrap();
assert_eq!(cfg.targets.len(), 1);
assert_eq!(cfg.targets[0].kind, TargetKind::Neuron);
assert_eq!(cfg.bench.samples_per_version, 5);
assert_eq!(cfg.scenarios.prompt_sizes, vec![128, 4096]);
Ok(())
});
}
#[test]
fn env_overrides_apply() {
Jail::expect_with(|jail| {
jail.create_file(
"helexa-bench.toml",
r#"
[bench]
samples_per_version = 3
[[targets]]
name = "benjy"
kind = "openai"
endpoint = "http://benjy:8080/v1"
"#,
)?;
jail.set_env("BENCH_BENCH__SAMPLES_PER_VERSION", "9");
let cfg = BenchConfig::load("helexa-bench.toml").unwrap();
assert_eq!(cfg.bench.samples_per_version, 9);
assert_eq!(cfg.targets[0].kind, TargetKind::Openai);
Ok(())
});
}
}

View File

@@ -0,0 +1,13 @@
//! helexa-bench — a continuous, version-aware benchmark harness for the
//! neuron fleet. It hits each neuron directly, exercises an extensible
//! scenario suite against every warm model, and records each run with
//! full build/version provenance into SQLite so improvements can be
//! tracked automatically across neuron implementation updates.
pub mod api;
pub mod client;
pub mod config;
pub mod report;
pub mod scenario;
pub mod store;
pub mod sweep;

View File

@@ -0,0 +1,153 @@
//! helexa-bench CLI.
//!
//! - `run` — continuous daemon (systemd default): sweep, sleep, repeat.
//! - `once` — a single sweep, then exit (manual / CI).
//! - `report` — render the SQLite store as a results table.
//!
//! Runs on a single-threaded runtime: the workload is batch-1 sequential
//! (one request at a time, the regime we measure), and it lets the
//! SQLite connection live across awaits without `Sync` gymnastics.
use anyhow::{Context, Result};
use clap::{Parser, Subcommand};
use helexa_bench::api;
use helexa_bench::config::BenchConfig;
use helexa_bench::report;
use helexa_bench::store::Store;
use helexa_bench::sweep::Sweeper;
use tracing_subscriber::EnvFilter;
#[derive(Parser)]
#[command(name = "helexa-bench")]
#[command(about = "Continuous version-aware benchmark harness for the neuron fleet")]
#[command(version)]
struct Cli {
#[command(subcommand)]
command: Command,
}
#[derive(Subcommand)]
enum Command {
/// Run sweeps continuously, pausing `sweep_interval_secs` between them.
Run {
#[arg(short, long, default_value = "helexa-bench.toml")]
config: String,
},
/// Run a single sweep over all targets, then exit.
Once {
#[arg(short, long, default_value = "helexa-bench.toml")]
config: String,
},
/// Serve the read-only JSON API only (no sweeping).
Serve {
#[arg(short, long, default_value = "helexa-bench.toml")]
config: String,
},
/// Render recorded results. Uses `--db` if given, else the db_path
/// from `--config`.
Report {
#[arg(short, long, default_value = "helexa-bench.toml")]
config: String,
/// Override the SQLite path (skips reading the config file).
#[arg(long)]
db: Option<String>,
/// Output format.
#[arg(long, default_value = "md")]
format: Format,
},
}
#[derive(Clone, Copy, clap::ValueEnum)]
enum Format {
Md,
Json,
}
fn main() -> Result<()> {
tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")),
)
.init();
let cli = Cli::parse();
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.context("building tokio runtime")?;
rt.block_on(run(cli))
}
async fn run(cli: Cli) -> Result<()> {
match cli.command {
Command::Run { config } => {
let cfg = load_config(&config)?;
require_targets(&cfg)?;
// Bind the read API alongside the sweep loop (one bob service
// does both). Its own store connection; WAL keeps the sweep
// writer and the API readers from blocking each other.
if cfg.api.enabled {
let state = api::open_state(&cfg.bench.db_path)?;
let listen = cfg.api.listen.clone();
tokio::spawn(async move {
if let Err(e) = api::serve(&listen, state).await {
tracing::error!(error = %format!("{e:#}"), "bench API server exited");
}
});
}
let sweeper = Sweeper::new(cfg)?;
tracing::info!("helexa-bench started; entering continuous sweep loop");
sweeper.run_forever().await
}
Command::Serve { config } => {
let cfg = load_config(&config)?;
if !cfg.api.enabled {
anyhow::bail!("[api] enabled = false — nothing to serve");
}
let state = api::open_state(&cfg.bench.db_path)?;
tracing::info!("helexa-bench serving API only");
api::serve(&cfg.api.listen, state).await
}
Command::Once { config } => {
let cfg = load_config(&config)?;
require_targets(&cfg)?;
let sweeper = Sweeper::new(cfg)?;
let summary = sweeper.run_once().await?;
tracing::info!(
measured = summary.measured,
skipped = summary.skipped,
failed = summary.failed,
unreachable = summary.targets_unreachable,
"single sweep complete"
);
Ok(())
}
Command::Report { config, db, format } => {
let db_path = match db {
Some(p) => p,
None => load_config(&config)?.bench.db_path,
};
let store = Store::open(&db_path)?;
let rows = store.report_rows()?;
let rendered = match format {
Format::Md => report::render_markdown(&rows),
Format::Json => report::render_json(&rows)?,
};
println!("{rendered}");
Ok(())
}
}
}
fn load_config(path: &str) -> Result<BenchConfig> {
BenchConfig::load(path)
.map_err(|e| anyhow::anyhow!("{e}"))
.with_context(|| format!("loading config {path}"))
}
fn require_targets(cfg: &BenchConfig) -> Result<()> {
if cfg.targets.is_empty() {
anyhow::bail!("no targets configured — add at least one [[targets]] entry");
}
Ok(())
}

View File

@@ -0,0 +1,109 @@
//! Render the SQLite store as a results table — the automated
//! replacement for hand-editing `doc/benchmarks.md`. Columns match that
//! doc: engine, model, prompt tok, TTFT (s), decode tok/s, total (s),
//! plus the build SHA each cell was measured against.
use crate::store::ReportRow;
use anyhow::Result;
pub fn render_markdown(rows: &[ReportRow]) -> String {
let mut out = String::new();
out.push_str(
"| engine | model | prompt tok | TTFT (s) | decode tok/s | total (s) | build | n |\n",
);
out.push_str("|---|---|---:|---:|---:|---:|---|---:|\n");
for r in rows {
let ptok = r
.prompt_tokens
.map(|t| t.to_string())
.unwrap_or_else(|| format!("~{}", r.prompt_size_approx));
out.push_str(&format!(
"| {} | {} | {} | {} | {} | {} | `{}` | {} |\n",
r.target_name,
r.model_id,
ptok,
fmt_opt(r.ttft_s_median, 3),
fmt_opt(r.decode_tps_median, 1),
fmt_opt(r.total_s_median, 3),
r.git_sha,
r.samples,
));
}
out
}
pub fn render_json(rows: &[ReportRow]) -> Result<String> {
let arr: Vec<serde_json::Value> = rows
.iter()
.map(|r| {
serde_json::json!({
"engine": r.target_name,
"model": r.model_id,
"scenario": r.scenario_id,
"prompt_size_approx": r.prompt_size_approx,
"prompt_tokens": r.prompt_tokens,
"ttft_s_median": r.ttft_s_median,
"decode_tps_median": r.decode_tps_median,
"total_s_median": r.total_s_median,
"git_sha": r.git_sha,
"samples": r.samples,
"gpu": r.gpu,
})
})
.collect();
Ok(serde_json::to_string_pretty(&arr)?)
}
fn fmt_opt(v: Option<f64>, places: usize) -> String {
match v {
Some(x) => format!("{x:.places$}"),
None => "".to_string(),
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn markdown_has_header_and_row() {
let rows = vec![ReportRow {
target_name: "beast".into(),
model_id: "Qwen/Qwen3.6-27B".into(),
scenario_id: "chat:128".into(),
prompt_size_approx: 128,
git_sha: "30d50d6".into(),
prompt_tokens: Some(130),
ttft_s_median: Some(0.123),
decode_tps_median: Some(45.6),
total_s_median: Some(1.234),
samples: 5,
gpu: Some("2× RTX 5090".into()),
}];
let md = render_markdown(&rows);
assert!(md.contains("| engine |"));
assert!(md.contains("beast"));
assert!(md.contains("`30d50d6`"));
assert!(md.contains("0.123"));
}
#[test]
fn missing_decode_renders_dash() {
let rows = vec![ReportRow {
target_name: "benjy".into(),
model_id: "m".into(),
scenario_id: "chat:128".into(),
prompt_size_approx: 128,
git_sha: "abc".into(),
prompt_tokens: None,
ttft_s_median: Some(0.1),
decode_tps_median: None,
total_s_median: Some(0.5),
samples: 1,
gpu: None,
}];
let md = render_markdown(&rows);
assert!(md.contains("~128"));
assert!(md.contains(""));
}
}

View File

@@ -0,0 +1,238 @@
//! The extensible test suite.
//!
//! A [`Scenario`] puts one warm model through one shaped request and
//! reports operator-felt metrics (TTFT, decode tok/s, total). Phase 1
//! ships the chat-latency family ported faithfully from `script/bench.py`;
//! the trait is the seam for future families (vision, concurrency,
//! long-generation, cold-start) selected per model via [`Scenario::applies_to`].
use crate::config::ScenarioConfig;
use anyhow::{Context, Result, anyhow};
use async_trait::async_trait;
use cortex_core::harness::ModelInfo;
use cortex_core::openai::ChatCompletionChunk;
use eventsource_stream::Eventsource;
use futures::StreamExt;
use serde_json::json;
use std::time::{Duration, Instant};
/// A paragraph of filler re-used to synthesise prompts of a target
/// approximate token count (~4 chars/token heuristic — close enough for
/// bucketing; real token counts are read back from the usage object).
/// Mirrors `script/bench.py::FILLER`.
const FILLER: &str = "The quick brown fox jumps over the lazy dog while the band plays \
a slow waltz in the background and somebody counts the beats. ";
/// `/no_think`: Qwen3-family soft switch keeping thinking models from
/// burning the token budget invisibly. Harmless for non-thinking models.
const QUESTION: &str = "\n\nRetell the scene above as a vivid story of about 300 words. /no_think";
/// Build a synthetic prompt of approximately `approx_tokens` tokens.
/// Ported from `bench.py::build_prompt`.
pub fn build_prompt(approx_tokens: u32) -> String {
let target_chars = (approx_tokens.max(16) as usize) * 4;
let reps = target_chars / FILLER.len() + 1;
let mut body = FILLER.repeat(reps);
body.truncate(target_chars);
body.push_str(QUESTION);
body
}
/// Per-request inputs shared by every scenario.
pub struct RunCtx<'a> {
pub client: &'a reqwest::Client,
/// Fully-qualified chat-completions URL for the target.
pub chat_url: String,
pub model_id: String,
pub max_tokens: u64,
pub timeout: Duration,
}
/// Operator-felt metrics for a single measured request.
#[derive(Debug, Clone)]
pub struct ScenarioMetrics {
/// Time to first content chunk (seconds).
pub ttft_s: f64,
/// Completion tokens / decode window. `None` when the window is too
/// short to be honest (≤ 200 ms), matching bench.py.
pub decode_tps: Option<f64>,
/// Wall-clock for the whole request (seconds).
pub total_s: f64,
/// Prompt tokens from the final `usage` object, if the server sent one.
pub prompt_tokens: Option<u64>,
/// Completion tokens: from `usage` when present, else content-chunk count.
pub completion_tokens: u64,
}
#[async_trait]
pub trait Scenario: Send + Sync {
/// Stable id, e.g. `chat:128`. Used as the version-aware skip key
/// dimension and recorded against every run.
fn id(&self) -> &str;
/// Approximate prompt size in tokens (the cell dimension), recorded
/// for reporting.
fn prompt_size(&self) -> u32;
/// Whether this scenario should run against the given model. Default
/// runs against everything; vision/audio scenarios will gate on
/// [`ModelInfo::capabilities`].
fn applies_to(&self, _model: &ModelInfo) -> bool {
true
}
/// Issue one shaped request and measure it.
async fn run(&self, ctx: &RunCtx) -> Result<ScenarioMetrics>;
}
/// Build the active scenario set from config. One chat-latency scenario
/// per configured prompt size.
pub fn build_scenarios(cfg: &ScenarioConfig) -> Vec<Box<dyn Scenario>> {
cfg.prompt_sizes
.iter()
.map(|&size| {
Box::new(ChatLatencyScenario {
id: format!("chat:{size}"),
approx_prompt_tokens: size,
}) as Box<dyn Scenario>
})
.collect()
}
/// Streamed single-request chat-completions latency probe — the batch-1
/// regime bench.py measures.
pub struct ChatLatencyScenario {
id: String,
approx_prompt_tokens: u32,
}
#[async_trait]
impl Scenario for ChatLatencyScenario {
fn id(&self) -> &str {
&self.id
}
fn prompt_size(&self) -> u32 {
self.approx_prompt_tokens
}
async fn run(&self, ctx: &RunCtx) -> Result<ScenarioMetrics> {
let prompt = build_prompt(self.approx_prompt_tokens);
let payload = json!({
"model": ctx.model_id,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": ctx.max_tokens,
"temperature": 0,
"stream": true,
"stream_options": {"include_usage": true},
});
let fut = stream_and_measure(ctx, &payload);
tokio::time::timeout(ctx.timeout, fut)
.await
.map_err(|_| anyhow!("request timed out after {:?}", ctx.timeout))?
}
}
/// The SSE-timing core, ported from `bench.py::one_run`. Kept free of the
/// `Scenario` trait so it's unit-testable against a mock byte stream.
async fn stream_and_measure(
ctx: &RunCtx<'_>,
payload: &serde_json::Value,
) -> Result<ScenarioMetrics> {
let start = Instant::now();
let resp = ctx
.client
.post(&ctx.chat_url)
.json(payload)
.send()
.await
.context("sending chat request")?;
if !resp.status().is_success() {
let status = resp.status();
let body = resp.text().await.unwrap_or_default();
return Err(anyhow!("upstream returned {status}: {}", body.trim()));
}
let mut stream = resp.bytes_stream().eventsource();
let mut first: Option<Instant> = None;
let mut last: Option<Instant> = None;
let mut chunk_count: u64 = 0;
let mut prompt_tokens: Option<u64> = None;
let mut completion_tokens: Option<u64> = None;
while let Some(event) = stream.next().await {
let event = event.context("reading SSE stream")?;
let now = Instant::now();
let data = event.data.trim();
if data.is_empty() || data == "[DONE]" {
continue;
}
let chunk: ChatCompletionChunk = match serde_json::from_str(data) {
Ok(c) => c,
Err(_) => continue, // tolerate non-JSON keepalive frames
};
if let Some(choice) = chunk.choices.first()
&& choice
.delta
.get("content")
.and_then(|c| c.as_str())
.is_some_and(|s| !s.is_empty())
{
if first.is_none() {
first = Some(now);
}
last = Some(now);
chunk_count += 1;
}
if let Some(usage) = chunk.usage {
prompt_tokens = Some(usage.prompt_tokens);
completion_tokens = Some(usage.completion_tokens);
}
}
let end = Instant::now();
let first = first.ok_or_else(|| anyhow!("no content chunks received"))?;
// neuron emits one SSE chunk per visible token, so chunk_count is an
// engine-truth count when no usage frame is sent.
let tokens = completion_tokens.filter(|&t| t > 0).unwrap_or(chunk_count);
// decode rate is only meaningful over a real inter-chunk window.
let window = last
.filter(|&l| l > first)
.map(|l| (l - first).as_secs_f64())
.unwrap_or(0.0);
Ok(ScenarioMetrics {
ttft_s: (first - start).as_secs_f64(),
decode_tps: if window > 0.2 {
Some(tokens as f64 / window)
} else {
None
},
total_s: (end - start).as_secs_f64(),
prompt_tokens,
completion_tokens: tokens,
})
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn prompt_grows_with_token_target() {
let small = build_prompt(128);
let big = build_prompt(4096);
assert!(big.len() > small.len());
// ~4 chars/token + the trailing question.
assert!(small.len() >= 128 * 4);
assert!(small.ends_with("/no_think"));
}
#[test]
fn prompt_floor_for_tiny_targets() {
// max(approx,16) floor means even 0 yields a non-trivial prompt.
let p = build_prompt(0);
assert!(p.len() >= 16 * 4);
}
}

View File

@@ -0,0 +1,768 @@
//! SQLite system-of-record. One row per measured iteration, keyed so a
//! benchmark can be attributed to the exact neuron build that produced
//! it. Replaces hand edits to `doc/benchmarks.md`.
//!
//! Calls are synchronous (SQLite is local and the sweep is batch-1
//! sequential), so the connection is used inline between `await` points,
//! never held across one.
use anyhow::{Context, Result};
use rusqlite::{Connection, OptionalExtension, params};
use std::path::Path;
/// A single measured (or failed) iteration, with full provenance.
#[derive(Debug, Clone)]
pub struct RunRecord {
pub ts: String, // RFC3339
// target
pub target_name: String,
pub target_kind: String,
pub endpoint: String,
// host (from /discovery)
pub hostname: Option<String>,
pub driver_version: Option<String>,
pub cuda_version: Option<String>,
pub gpus_json: Option<String>,
// neuron build (from /version)
pub git_sha: String,
pub git_sha_long: Option<String>,
pub package_version: String,
pub git_dirty: bool,
pub build_timestamp: Option<String>,
pub rustc_version: Option<String>,
pub profile: Option<String>,
pub features_json: String,
pub candle_version: Option<String>,
// bench's own build
pub bench_version: String,
pub bench_sha: String,
// model
pub model_id: String,
pub harness: String,
pub capabilities_json: String,
pub devices_json: String,
// scenario
pub scenario_id: String,
pub prompt_size_approx: u32,
pub prompt_tokens_actual: Option<u64>,
pub max_tokens: u64,
// metrics
pub ttft_s: Option<f64>,
pub decode_tps: Option<f64>,
pub total_s: Option<f64>,
pub completion_tokens: Option<u64>,
// outcome
pub ok: bool,
pub error: Option<String>,
}
pub struct Store {
conn: Connection,
}
impl Store {
/// Open (creating parent dirs + schema as needed).
pub fn open(path: impl AsRef<Path>) -> Result<Self> {
let path = path.as_ref();
if let Some(parent) = path.parent()
&& !parent.as_os_str().is_empty()
{
std::fs::create_dir_all(parent)
.with_context(|| format!("creating db dir {}", parent.display()))?;
}
let conn = Connection::open(path)
.with_context(|| format!("opening sqlite db {}", path.display()))?;
Self::init(&conn)?;
Ok(Store { conn })
}
/// In-memory store for tests.
#[cfg(test)]
pub fn open_in_memory() -> Result<Self> {
let conn = Connection::open_in_memory()?;
Self::init(&conn)?;
Ok(Store { conn })
}
fn init(conn: &Connection) -> Result<()> {
conn.execute_batch(
r#"
-- WAL so the read-only API connection never blocks the
-- sweep writer (and vice versa).
PRAGMA journal_mode=WAL;
CREATE TABLE IF NOT EXISTS runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL,
target_name TEXT NOT NULL,
target_kind TEXT NOT NULL,
endpoint TEXT NOT NULL,
hostname TEXT,
driver_version TEXT,
cuda_version TEXT,
gpus_json TEXT,
git_sha TEXT NOT NULL,
git_sha_long TEXT,
package_version TEXT NOT NULL,
git_dirty INTEGER NOT NULL,
build_timestamp TEXT,
rustc_version TEXT,
profile TEXT,
features_json TEXT NOT NULL,
candle_version TEXT,
bench_version TEXT NOT NULL,
bench_sha TEXT NOT NULL,
model_id TEXT NOT NULL,
harness TEXT NOT NULL,
capabilities_json TEXT NOT NULL,
devices_json TEXT NOT NULL,
scenario_id TEXT NOT NULL,
prompt_size_approx INTEGER NOT NULL,
prompt_tokens_actual INTEGER,
max_tokens INTEGER NOT NULL,
ttft_s REAL,
decode_tps REAL,
total_s REAL,
completion_tokens INTEGER,
ok INTEGER NOT NULL,
error TEXT
);
-- The version-aware skip query keys on this tuple. scenario_id
-- encodes the prompt size (chat:<n>), so it subsumes the cell.
CREATE INDEX IF NOT EXISTS idx_runs_cell
ON runs (target_name, git_sha, model_id, scenario_id, ok);
"#,
)
.context("initialising sqlite schema")?;
Ok(())
}
/// Count successful samples already recorded for a cell. Only `ok`
/// rows count toward the per-version target so transient failures
/// don't permanently starve a cell.
pub fn count_samples(
&self,
target_name: &str,
git_sha: &str,
model_id: &str,
scenario_id: &str,
) -> Result<u32> {
let n: i64 = self.conn.query_row(
"SELECT COUNT(*) FROM runs WHERE target_name=?1 AND git_sha=?2 \
AND model_id=?3 AND scenario_id=?4 AND ok=1",
params![target_name, git_sha, model_id, scenario_id],
|row| row.get(0),
)?;
Ok(n as u32)
}
pub fn insert_run(&self, r: &RunRecord) -> Result<()> {
self.conn.execute(
"INSERT INTO runs (
ts, target_name, target_kind, endpoint,
hostname, driver_version, cuda_version, gpus_json,
git_sha, git_sha_long, package_version, git_dirty,
build_timestamp, rustc_version, profile, features_json, candle_version,
bench_version, bench_sha,
model_id, harness, capabilities_json, devices_json,
scenario_id, prompt_size_approx, prompt_tokens_actual, max_tokens,
ttft_s, decode_tps, total_s, completion_tokens,
ok, error
) VALUES (
?1, ?2, ?3, ?4,
?5, ?6, ?7, ?8,
?9, ?10, ?11, ?12,
?13, ?14, ?15, ?16, ?17,
?18, ?19,
?20, ?21, ?22, ?23,
?24, ?25, ?26, ?27,
?28, ?29, ?30, ?31,
?32, ?33
)",
params![
r.ts,
r.target_name,
r.target_kind,
r.endpoint,
r.hostname,
r.driver_version,
r.cuda_version,
r.gpus_json,
r.git_sha,
r.git_sha_long,
r.package_version,
r.git_dirty as i64,
r.build_timestamp,
r.rustc_version,
r.profile,
r.features_json,
r.candle_version,
r.bench_version,
r.bench_sha,
r.model_id,
r.harness,
r.capabilities_json,
r.devices_json,
r.scenario_id,
r.prompt_size_approx,
r.prompt_tokens_actual,
r.max_tokens,
r.ttft_s,
r.decode_tps,
r.total_s,
r.completion_tokens,
r.ok as i64,
r.error,
],
)?;
Ok(())
}
/// One reportable cell: the median metrics over the most-recently-seen
/// build SHA for each (target, model, scenario).
pub fn report_rows(&self) -> Result<Vec<ReportRow>> {
// For each (target, model, scenario), find the SHA of the latest
// successful run, then median that SHA's samples.
let mut stmt = self.conn.prepare(
"SELECT target_name, model_id, scenario_id, prompt_size_approx, git_sha,
ttft_s, decode_tps, total_s, prompt_tokens_actual, gpus_json
FROM runs
WHERE ok=1
ORDER BY target_name, model_id, scenario_id, id",
)?;
let rows = stmt.query_map([], |row| {
Ok(RawRow {
target_name: row.get(0)?,
model_id: row.get(1)?,
scenario_id: row.get(2)?,
prompt_size_approx: row.get(3)?,
git_sha: row.get(4)?,
ttft_s: row.get(5)?,
decode_tps: row.get(6)?,
total_s: row.get(7)?,
prompt_tokens_actual: row.get(8)?,
gpus_json: row.get(9)?,
})
})?;
let raws: Vec<RawRow> = rows.collect::<rusqlite::Result<_>>()?;
Ok(aggregate(raws))
}
// ── Read API surface (consumed by api.rs) ─────────────────────────
/// Total recorded runs (for `/api/health`).
pub fn run_count(&self) -> Result<u64> {
let n: i64 = self
.conn
.query_row("SELECT COUNT(*) FROM runs", [], |row| row.get(0))?;
Ok(n as u64)
}
/// Distinct hosts / models / scenarios / builds, for populating UI
/// filters. Builds are ordered chronologically by build timestamp
/// (falling back to first-seen wall-clock).
pub fn dimensions(&self) -> Result<Dimensions> {
let col = |sql: &str| -> Result<Vec<String>> {
let mut stmt = self.conn.prepare(sql)?;
let rows = stmt.query_map([], |r| r.get::<_, String>(0))?;
Ok(rows.collect::<rusqlite::Result<_>>()?)
};
let hosts = col("SELECT DISTINCT target_name FROM runs ORDER BY target_name")?;
let models = col("SELECT DISTINCT model_id FROM runs ORDER BY model_id")?;
let scenarios = col("SELECT DISTINCT scenario_id FROM runs ORDER BY scenario_id")?;
let mut stmt = self.conn.prepare(
"SELECT git_sha, MAX(build_timestamp), MAX(package_version), MIN(COALESCE(build_timestamp, ts)) AS ord
FROM runs GROUP BY git_sha ORDER BY ord",
)?;
let builds = stmt
.query_map([], |r| {
Ok(BuildRef {
git_sha: r.get(0)?,
build_timestamp: r.get(1)?,
package_version: r.get(2)?,
})
})?
.collect::<rusqlite::Result<_>>()?;
// host/model → GPU label, taken from each one's most recent run.
let gpu_map = |group_col: &str| -> Result<std::collections::HashMap<String, String>> {
let sql = format!(
"SELECT {group_col}, gpus_json FROM runs \
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY {group_col})"
);
let mut stmt = self.conn.prepare(&sql)?;
let rows = stmt.query_map([], |r| {
Ok((r.get::<_, String>(0)?, r.get::<_, Option<String>>(1)?))
})?;
let mut out = std::collections::HashMap::new();
for row in rows {
let (key, gpus) = row?;
if let Some(label) = gpus.as_deref().and_then(gpu_label) {
out.insert(key, label);
}
}
Ok(out)
};
let host_gpus = gpu_map("target_name")?;
let model_gpus = gpu_map("model_id")?;
Ok(Dimensions {
hosts,
models,
scenarios,
builds,
host_gpus,
model_gpus,
})
}
/// Latest-SHA-per-cell medians (the report table as JSON).
pub fn summary(&self) -> Result<Vec<ReportRow>> {
self.report_rows()
}
/// Per-build median metrics for one (model, scenario) cell, ordered
/// chronologically by build — the "over time" series. `host` is
/// optional: when omitted it resolves to the host with the most recent
/// run for this (model, scenario). Each model is served by a single
/// host today, so this yields a coherent single-host series and lets
/// callers (the public UI) select by model alone.
pub fn series(
&self,
host: Option<&str>,
model: &str,
scenario: &str,
) -> Result<Vec<SeriesPoint>> {
let host = match host {
Some(h) => h.to_string(),
None => {
let resolved: Option<String> = self
.conn
.query_row(
"SELECT target_name FROM runs WHERE ok=1 AND model_id=?1 \
AND scenario_id=?2 ORDER BY id DESC LIMIT 1",
params![model, scenario],
|r| r.get(0),
)
.optional()?;
match resolved {
Some(h) => h,
None => return Ok(Vec::new()),
}
}
};
let mut stmt = self.conn.prepare(
"SELECT git_sha, build_timestamp, package_version, ttft_s, decode_tps, total_s, ts
FROM runs
WHERE ok=1 AND target_name=?1 AND model_id=?2 AND scenario_id=?3
ORDER BY id",
)?;
let raws: Vec<SeriesRaw> = stmt
.query_map(params![host, model, scenario], |r| {
Ok(SeriesRaw {
git_sha: r.get(0)?,
build_timestamp: r.get(1)?,
package_version: r.get(2)?,
ttft_s: r.get(3)?,
decode_tps: r.get(4)?,
total_s: r.get(5)?,
ts: r.get(6)?,
})
})?
.collect::<rusqlite::Result<_>>()?;
Ok(aggregate_series(raws))
}
/// Raw rows, optionally filtered. For drill-down + programmatic access.
pub fn runs(&self, f: &RunFilter) -> Result<Vec<RunRow>> {
let mut sql = String::from(
"SELECT id, ts, target_name, hostname, git_sha, build_timestamp, package_version,
model_id, harness, scenario_id, prompt_size_approx, prompt_tokens_actual,
max_tokens, ttft_s, decode_tps, total_s, completion_tokens, ok, error,
gpus_json
FROM runs",
);
let mut conds: Vec<String> = Vec::new();
let mut args: Vec<Box<dyn rusqlite::ToSql>> = Vec::new();
let bind = |col: &str,
val: Option<&str>,
conds: &mut Vec<String>,
args: &mut Vec<Box<dyn rusqlite::ToSql>>| {
if let Some(v) = val {
args.push(Box::new(v.to_string()));
conds.push(format!("{col}=?{}", args.len()));
}
};
bind("target_name", f.host.as_deref(), &mut conds, &mut args);
bind("model_id", f.model.as_deref(), &mut conds, &mut args);
bind("scenario_id", f.scenario.as_deref(), &mut conds, &mut args);
bind("git_sha", f.sha.as_deref(), &mut conds, &mut args);
if let Some(ok) = f.ok {
args.push(Box::new(ok as i64));
conds.push(format!("ok=?{}", args.len()));
}
if !conds.is_empty() {
sql.push_str(" WHERE ");
sql.push_str(&conds.join(" AND "));
}
sql.push_str(" ORDER BY id DESC");
let limit = f.limit.unwrap_or(500).min(5000);
args.push(Box::new(limit as i64));
sql.push_str(&format!(" LIMIT ?{}", args.len()));
let mut stmt = self.conn.prepare(&sql)?;
let rows = stmt
.query_map(rusqlite::params_from_iter(args.iter()), |r| {
let gpus_json: Option<String> = r.get(19)?;
Ok(RunRow {
id: r.get(0)?,
ts: r.get(1)?,
host: r.get(2)?,
gpu: gpus_json.as_deref().and_then(gpu_label),
hostname: r.get(3)?,
git_sha: r.get(4)?,
build_timestamp: r.get(5)?,
package_version: r.get(6)?,
model_id: r.get(7)?,
harness: r.get(8)?,
scenario_id: r.get(9)?,
prompt_size_approx: r.get(10)?,
prompt_tokens_actual: r.get(11)?,
max_tokens: r.get(12)?,
ttft_s: r.get(13)?,
decode_tps: r.get(14)?,
total_s: r.get(15)?,
completion_tokens: r.get(16)?,
ok: r.get::<_, i64>(17)? != 0,
error: r.get(18)?,
})
})?
.collect::<rusqlite::Result<_>>()?;
Ok(rows)
}
}
// ── Read-API serde types ──────────────────────────────────────────────
#[derive(Debug, Clone, serde::Serialize)]
pub struct Dimensions {
pub hosts: Vec<String>,
pub models: Vec<String>,
pub scenarios: Vec<String>,
pub builds: Vec<BuildRef>,
/// host → GPU label (latest run), so the UI can show the GPU as the
/// resource name instead of the internal hostname.
pub host_gpus: std::collections::HashMap<String, String>,
/// model → GPU label (latest run); model maps to one host today.
pub model_gpus: std::collections::HashMap<String, String>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct BuildRef {
pub git_sha: String,
pub build_timestamp: Option<String>,
pub package_version: Option<String>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct SeriesPoint {
pub git_sha: String,
pub build_timestamp: Option<String>,
pub package_version: Option<String>,
pub ttft_s_median: Option<f64>,
pub decode_tps_median: Option<f64>,
pub total_s_median: Option<f64>,
pub samples: usize,
}
struct SeriesRaw {
git_sha: String,
build_timestamp: Option<String>,
package_version: Option<String>,
ttft_s: Option<f64>,
decode_tps: Option<f64>,
total_s: Option<f64>,
ts: String,
}
/// Group id-ordered rows by build SHA, median each metric, and order the
/// resulting points chronologically by build (timestamp, else first ts).
fn aggregate_series(raws: Vec<SeriesRaw>) -> Vec<SeriesPoint> {
use std::collections::BTreeMap;
// Preserve first-seen order per sha for the chronological sort key.
let mut order: Vec<String> = Vec::new();
let mut groups: BTreeMap<String, Vec<SeriesRaw>> = BTreeMap::new();
for r in raws {
if !groups.contains_key(&r.git_sha) {
order.push(r.git_sha.clone());
}
groups.entry(r.git_sha.clone()).or_default().push(r);
}
let mut points: Vec<(String, SeriesPoint)> = order
.into_iter()
.map(|sha| {
let rows = &groups[&sha];
let sort_key = rows
.iter()
.map(|r| r.build_timestamp.clone().unwrap_or_else(|| r.ts.clone()))
.min()
.unwrap_or_default();
let point = SeriesPoint {
git_sha: sha,
build_timestamp: rows.iter().find_map(|r| r.build_timestamp.clone()),
package_version: rows.iter().find_map(|r| r.package_version.clone()),
ttft_s_median: median(rows.iter().filter_map(|r| r.ttft_s)),
decode_tps_median: median(rows.iter().filter_map(|r| r.decode_tps)),
total_s_median: median(rows.iter().filter_map(|r| r.total_s)),
samples: rows.len(),
};
(sort_key, point)
})
.collect();
points.sort_by(|a, b| a.0.cmp(&b.0));
points.into_iter().map(|(_, p)| p).collect()
}
#[derive(Debug, Clone, Default)]
pub struct RunFilter {
pub host: Option<String>,
pub model: Option<String>,
pub scenario: Option<String>,
pub sha: Option<String>,
pub ok: Option<bool>,
pub limit: Option<u32>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct RunRow {
pub id: i64,
pub ts: String,
pub host: String,
/// Public-facing resource name (the host's GPU(s)), e.g. "RTX 4090".
pub gpu: Option<String>,
pub hostname: Option<String>,
pub git_sha: String,
pub build_timestamp: Option<String>,
pub package_version: String,
pub model_id: String,
pub harness: String,
pub scenario_id: String,
pub prompt_size_approx: u32,
pub prompt_tokens_actual: Option<u64>,
pub max_tokens: u64,
pub ttft_s: Option<f64>,
pub decode_tps: Option<f64>,
pub total_s: Option<f64>,
pub completion_tokens: Option<u64>,
pub ok: bool,
pub error: Option<String>,
}
struct RawRow {
target_name: String,
model_id: String,
scenario_id: String,
prompt_size_approx: u32,
git_sha: String,
ttft_s: Option<f64>,
decode_tps: Option<f64>,
total_s: Option<f64>,
prompt_tokens_actual: Option<u64>,
gpus_json: Option<String>,
}
/// An aggregated cell ready for the report table.
#[derive(Debug, Clone, PartialEq, serde::Serialize)]
pub struct ReportRow {
pub target_name: String,
pub model_id: String,
pub scenario_id: String,
pub prompt_size_approx: u32,
pub git_sha: String,
pub prompt_tokens: Option<u64>,
pub ttft_s_median: Option<f64>,
pub decode_tps_median: Option<f64>,
pub total_s_median: Option<f64>,
pub samples: usize,
/// Public-facing resource name (the host's GPU(s)), e.g. "2× RTX 5090".
pub gpu: Option<String>,
}
/// Group by (target, model, scenario), keep only the latest SHA's rows
/// (latest = the SHA of the last-inserted row, since input is id-ordered),
/// and median each metric.
fn aggregate(raws: Vec<RawRow>) -> Vec<ReportRow> {
use std::collections::BTreeMap;
// key -> (latest_sha, rows for that sha)
let mut groups: BTreeMap<(String, String, String), Vec<RawRow>> = BTreeMap::new();
for r in raws {
groups
.entry((
r.target_name.clone(),
r.model_id.clone(),
r.scenario_id.clone(),
))
.or_default()
.push(r);
}
let mut out = Vec::new();
for ((target_name, model_id, scenario_id), rows) in groups {
// id-ordered, so the last row carries the latest SHA.
let latest_sha = rows.last().map(|r| r.git_sha.clone()).unwrap_or_default();
let cell: Vec<&RawRow> = rows.iter().filter(|r| r.git_sha == latest_sha).collect();
let prompt_size_approx = cell.first().map(|r| r.prompt_size_approx).unwrap_or(0);
out.push(ReportRow {
target_name,
model_id,
scenario_id,
prompt_size_approx,
git_sha: latest_sha,
prompt_tokens: cell.iter().find_map(|r| r.prompt_tokens_actual),
ttft_s_median: median(cell.iter().filter_map(|r| r.ttft_s)),
decode_tps_median: median(cell.iter().filter_map(|r| r.decode_tps)),
total_s_median: median(cell.iter().filter_map(|r| r.total_s)),
samples: cell.len(),
gpu: cell
.iter()
.find_map(|r| r.gpus_json.as_deref().and_then(gpu_label)),
});
}
out
}
/// Compact GPU label from a run's stored `gpus_json` (the discovery device
/// list) — e.g. "2× RTX 5090", "RTX 4090". `None` when empty/absent. Used
/// as the public-facing resource name in place of internal hostnames.
fn gpu_label(gpus_json: &str) -> Option<String> {
let devices: Vec<serde_json::Value> = serde_json::from_str(gpus_json).ok()?;
if devices.is_empty() {
return None;
}
let mut order: Vec<String> = Vec::new();
let mut counts: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
for d in &devices {
let name = d.get("name").and_then(|v| v.as_str()).unwrap_or("GPU");
let short = name
.trim_start_matches("NVIDIA GeForce ")
.trim_start_matches("NVIDIA ")
.to_string();
if !counts.contains_key(&short) {
order.push(short.clone());
}
*counts.entry(short).or_insert(0) += 1;
}
Some(
order
.iter()
.map(|n| {
let c = counts[n];
if c > 1 {
format!("{c}× {n}")
} else {
n.clone()
}
})
.collect::<Vec<_>>()
.join(" + "),
)
}
fn median(values: impl Iterator<Item = f64>) -> Option<f64> {
let mut v: Vec<f64> = values.collect();
if v.is_empty() {
return None;
}
v.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
// lo == hi for odd lengths (the middle element); they straddle the
// centre for even lengths. Avoids a `% 2` branch.
let lo = (v.len() - 1) / 2;
let hi = v.len() / 2;
Some((v[lo] + v[hi]) / 2.0)
}
#[cfg(test)]
mod tests {
use super::*;
fn rec(target: &str, sha: &str, model: &str, scenario: &str, ok: bool) -> RunRecord {
RunRecord {
ts: "2026-06-13T00:00:00Z".into(),
target_name: target.into(),
target_kind: "neuron".into(),
endpoint: "http://x:13131".into(),
hostname: Some("x".into()),
driver_version: None,
cuda_version: None,
gpus_json: None,
git_sha: sha.into(),
git_sha_long: None,
package_version: "0.1.16".into(),
git_dirty: false,
build_timestamp: None,
rustc_version: None,
profile: None,
features_json: "[]".into(),
candle_version: None,
bench_version: "0.1.16".into(),
bench_sha: "deadbee".into(),
model_id: model.into(),
harness: "candle".into(),
capabilities_json: "[]".into(),
devices_json: "[]".into(),
scenario_id: scenario.into(),
prompt_size_approx: 128,
prompt_tokens_actual: Some(130),
max_tokens: 256,
ttft_s: Some(0.1),
decode_tps: Some(50.0),
total_s: Some(1.0),
completion_tokens: Some(50),
ok,
error: if ok { None } else { Some("boom".into()) },
}
}
#[test]
fn counts_only_successful_samples() {
let s = Store::open_in_memory().unwrap();
s.insert_run(&rec("beast", "abc", "m", "chat:128", true))
.unwrap();
s.insert_run(&rec("beast", "abc", "m", "chat:128", true))
.unwrap();
s.insert_run(&rec("beast", "abc", "m", "chat:128", false))
.unwrap();
assert_eq!(s.count_samples("beast", "abc", "m", "chat:128").unwrap(), 2);
// Different SHA is a different cell.
assert_eq!(s.count_samples("beast", "xyz", "m", "chat:128").unwrap(), 0);
}
#[test]
fn report_uses_latest_sha_per_cell() {
let s = Store::open_in_memory().unwrap();
// old build
s.insert_run(&rec("beast", "old", "m", "chat:128", true))
.unwrap();
// new build, two samples
let mut r = rec("beast", "new", "m", "chat:128", true);
r.ttft_s = Some(0.2);
s.insert_run(&r).unwrap();
r.ttft_s = Some(0.4);
s.insert_run(&r).unwrap();
let rows = s.report_rows().unwrap();
assert_eq!(rows.len(), 1);
assert_eq!(rows[0].git_sha, "new");
assert_eq!(rows[0].samples, 2);
assert!((rows[0].ttft_s_median.unwrap() - 0.3).abs() < 1e-9);
}
#[test]
fn gpu_label_formats() {
let two = r#"[{"name":"NVIDIA GeForce RTX 5090"},{"name":"NVIDIA GeForce RTX 5090"}]"#;
assert_eq!(gpu_label(two).as_deref(), Some("2× RTX 5090"));
let one = r#"[{"name":"NVIDIA GeForce RTX 4090"}]"#;
assert_eq!(gpu_label(one).as_deref(), Some("RTX 4090"));
let dc = r#"[{"name":"NVIDIA H100"}]"#;
assert_eq!(gpu_label(dc).as_deref(), Some("H100"));
assert_eq!(gpu_label("[]"), None);
}
}

View File

@@ -0,0 +1,250 @@
//! The version-aware sweep loop.
//!
//! Each sweep visits every configured target, polls its build identity
//! and warm models, and tops up benchmark samples per
//! (target, build SHA, model, scenario) to `samples_per_version`. Cells
//! already at target are skipped — so once every neuron's current build
//! is fully sampled, sweeps cost only the cheap metadata polls until a
//! new SHA ships. Runs are recorded to SQLite with full provenance.
use crate::client::TargetClient;
use crate::config::{BenchConfig, TargetConfig, TargetKind};
use crate::scenario::{RunCtx, build_scenarios};
use crate::store::{RunRecord, Store};
use anyhow::Result;
use cortex_core::build_info::BuildInfo;
use cortex_core::discovery::DiscoveryResponse;
use cortex_core::harness::ModelInfo;
/// helexa-bench's own build version.
fn bench_version() -> String {
env!("CARGO_PKG_VERSION").to_string()
}
/// helexa-bench's own build SHA, injected by CI via `HELEXA_BUILD_SHA`
/// at compile time; `"unknown"` for ad-hoc local builds.
fn bench_sha() -> String {
option_env!("HELEXA_BUILD_SHA")
.filter(|s| !s.is_empty())
.unwrap_or("unknown")
.to_string()
}
#[derive(Debug, Default, Clone)]
pub struct SweepSummary {
pub measured: usize,
pub skipped: usize,
pub failed: usize,
pub targets_unreachable: usize,
}
pub struct Sweeper {
cfg: BenchConfig,
client: TargetClient,
store: Store,
}
impl Sweeper {
pub fn new(cfg: BenchConfig) -> Result<Self> {
let client = TargetClient::new(cfg.bench.request_timeout())?;
let store = Store::open(&cfg.bench.db_path)?;
Ok(Sweeper { cfg, client, store })
}
/// Run sweeps forever, pausing `sweep_interval` between them.
pub async fn run_forever(&self) -> ! {
loop {
match self.run_once().await {
Ok(s) => tracing::info!(
measured = s.measured,
skipped = s.skipped,
failed = s.failed,
unreachable = s.targets_unreachable,
"sweep complete"
),
Err(e) => tracing::error!(error = %format!("{e:#}"), "sweep errored"),
}
tracing::debug!(
secs = self.cfg.bench.sweep_interval_secs,
"sleeping until next sweep"
);
tokio::time::sleep(self.cfg.bench.sweep_interval()).await;
}
}
/// One full pass over all targets.
pub async fn run_once(&self) -> Result<SweepSummary> {
let mut summary = SweepSummary::default();
for target in &self.cfg.targets {
if let Err(e) = self.sweep_target(target, &mut summary).await {
summary.targets_unreachable += 1;
tracing::warn!(target = %target.name, error = %format!("{e:#}"), "target skipped");
}
}
Ok(summary)
}
async fn sweep_target(&self, target: &TargetConfig, summary: &mut SweepSummary) -> Result<()> {
let build = self.client.fetch_version(target).await?;
let discovery = self.client.fetch_discovery(target).await.unwrap_or(None);
let models = self.client.warm_models(target).await?;
tracing::info!(
target = %target.name,
sha = %build.git_sha,
warm_models = models.len(),
"sweeping target"
);
let scenarios = build_scenarios(&self.cfg.scenarios);
for model in &models {
for scenario in scenarios.iter().filter(|s| s.applies_to(model)) {
let have = self.store.count_samples(
&target.name,
&build.git_sha,
&model.id,
scenario.id(),
)?;
let need = self.cfg.bench.samples_per_version.saturating_sub(have);
if need == 0 {
summary.skipped += 1;
tracing::debug!(
target = %target.name, model = %model.id, scenario = scenario.id(),
sha = %build.git_sha, "cell already satisfied, skipping"
);
continue;
}
let ctx = RunCtx {
client: self.client.http(),
chat_url: self.client.chat_url(target),
model_id: model.id.clone(),
max_tokens: self.cfg.scenarios.max_tokens,
timeout: self.cfg.bench.request_timeout(),
};
// One unmeasured warmup when the cell is empty (matches
// bench.py — first run after a load hits cold caches).
if have == 0 {
tracing::debug!(model = %model.id, scenario = scenario.id(), "warmup run");
let _ = scenario.run(&ctx).await;
}
for i in 0..need {
match scenario.run(&ctx).await {
Ok(m) => {
let rec = self.build_record(
target,
&build,
discovery.as_ref(),
model,
scenario.id(),
scenario.prompt_size(),
Ok(&m),
);
self.store.insert_run(&rec)?;
summary.measured += 1;
tracing::info!(
target = %target.name, model = %model.id, scenario = scenario.id(),
ttft_s = m.ttft_s, decode_tps = ?m.decode_tps, total_s = m.total_s,
"{}/{} recorded", have + i + 1, self.cfg.bench.samples_per_version
);
}
Err(e) => {
let msg = format!("{e:#}");
let rec = self.build_record(
target,
&build,
discovery.as_ref(),
model,
scenario.id(),
scenario.prompt_size(),
Err(&msg),
);
self.store.insert_run(&rec)?;
summary.failed += 1;
tracing::warn!(
target = %target.name, model = %model.id, scenario = scenario.id(),
error = %msg, "iteration failed"
);
}
}
tokio::time::sleep(self.cfg.bench.iteration_pause()).await;
}
}
}
Ok(())
}
#[allow(clippy::too_many_arguments)]
fn build_record(
&self,
target: &TargetConfig,
build: &BuildInfo,
discovery: Option<&DiscoveryResponse>,
model: &ModelInfo,
scenario_id: &str,
prompt_size: u32,
result: Result<&crate::scenario::ScenarioMetrics, &str>,
) -> RunRecord {
let (ok, error, ttft, decode, total, prompt_tokens, completion) = match result {
Ok(m) => (
true,
None,
Some(m.ttft_s),
m.decode_tps,
Some(m.total_s),
m.prompt_tokens,
Some(m.completion_tokens),
),
Err(e) => (false, Some(e.to_string()), None, None, None, None, None),
};
RunRecord {
ts: chrono::Utc::now().to_rfc3339(),
target_name: target.name.clone(),
target_kind: kind_str(target.kind).to_string(),
endpoint: target.endpoint.clone(),
hostname: discovery.map(|d| d.hostname.clone()),
driver_version: discovery.and_then(|d| d.driver_version.clone()),
cuda_version: discovery.and_then(|d| d.cuda_version.clone()),
gpus_json: discovery
.map(|d| serde_json::to_string(&d.devices).unwrap_or_else(|_| "[]".to_string())),
git_sha: build.git_sha.clone(),
git_sha_long: build.git_sha_long.clone(),
package_version: build.package_version.clone(),
git_dirty: build.git_dirty,
build_timestamp: build.build_timestamp.clone(),
rustc_version: build.rustc_version.clone(),
profile: build.profile.clone(),
features_json: serde_json::to_string(&build.features)
.unwrap_or_else(|_| "[]".to_string()),
candle_version: build.candle_version.clone(),
bench_version: bench_version(),
bench_sha: bench_sha(),
model_id: model.id.clone(),
harness: model.harness.clone(),
capabilities_json: serde_json::to_string(&model.capabilities)
.unwrap_or_else(|_| "[]".to_string()),
devices_json: serde_json::to_string(&model.devices)
.unwrap_or_else(|_| "[]".to_string()),
scenario_id: scenario_id.to_string(),
prompt_size_approx: prompt_size,
prompt_tokens_actual: prompt_tokens,
max_tokens: self.cfg.scenarios.max_tokens,
ttft_s: ttft,
decode_tps: decode,
total_s: total,
completion_tokens: completion,
ok,
error,
}
}
}
fn kind_str(kind: TargetKind) -> &'static str {
match kind {
TargetKind::Neuron => "neuron",
TargetKind::Openai => "openai",
}
}

View File

@@ -0,0 +1,219 @@
//! Read-API tests: seed a temp store, serve the router, assert JSON.
use helexa_bench::api;
use helexa_bench::store::{RunRecord, Store};
use serde_json::Value;
#[allow(clippy::too_many_arguments)]
fn rec(
host: &str,
sha: &str,
build_ts: Option<&str>,
model: &str,
scenario: &str,
ttft: f64,
ok: bool,
) -> RunRecord {
RunRecord {
ts: "2026-06-13T00:00:00Z".into(),
target_name: host.into(),
target_kind: "neuron".into(),
endpoint: format!("http://{host}:13131"),
hostname: Some(host.into()),
driver_version: Some("580.159".into()),
cuda_version: Some("13.0".into()),
gpus_json: Some("[]".into()),
git_sha: sha.into(),
git_sha_long: None,
package_version: "0.1.16".into(),
git_dirty: false,
build_timestamp: build_ts.map(|s| s.to_string()),
rustc_version: None,
profile: Some("release".into()),
features_json: "[\"cuda\"]".into(),
candle_version: Some("0.10.2".into()),
bench_version: "0.1.16".into(),
bench_sha: "deadbee".into(),
model_id: model.into(),
harness: "candle".into(),
capabilities_json: "[\"text\"]".into(),
devices_json: "[0]".into(),
scenario_id: scenario.into(),
prompt_size_approx: 128,
prompt_tokens_actual: Some(130),
max_tokens: 64,
ttft_s: if ok { Some(ttft) } else { None },
decode_tps: if ok { Some(30.0) } else { None },
total_s: if ok { Some(2.0) } else { None },
completion_tokens: if ok { Some(60) } else { None },
ok,
error: if ok { None } else { Some("boom".into()) },
}
}
/// Seed a temp db, return its path.
fn seed(tag: &str) -> String {
let path = std::env::temp_dir().join(format!("hb-api-{}-{tag}.sqlite", std::process::id()));
let _ = std::fs::remove_file(&path);
let p = path.to_string_lossy().to_string();
let store = Store::open(&p).unwrap();
// beast / m / chat:128 across two builds (old then new).
store
.insert_run(&rec(
"beast",
"old",
Some("2026-06-01T00:00:00Z"),
"m",
"chat:128",
0.20,
true,
))
.unwrap();
store
.insert_run(&rec(
"beast",
"new",
Some("2026-06-10T00:00:00Z"),
"m",
"chat:128",
0.10,
true,
))
.unwrap();
store
.insert_run(&rec(
"beast",
"new",
Some("2026-06-10T00:00:00Z"),
"m",
"chat:128",
0.12,
true,
))
.unwrap();
// a failed row (must not count in series/summary medians)
store
.insert_run(&rec(
"beast",
"new",
Some("2026-06-10T00:00:00Z"),
"m",
"chat:128",
0.0,
false,
))
.unwrap();
// a different host for the runs filter
store
.insert_run(&rec(
"benjy",
"new",
Some("2026-06-10T00:00:00Z"),
"n",
"chat:128",
0.15,
true,
))
.unwrap();
p
}
async fn spawn(db: &str) -> String {
let state = api::open_state(db).unwrap();
let app = api::api_routes(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
format!("http://{addr}")
}
async fn get(base: &str, path: &str) -> Value {
reqwest::get(format!("{base}{path}"))
.await
.unwrap()
.json()
.await
.unwrap()
}
#[tokio::test]
async fn health_reports_run_count() {
let base = spawn(&seed("health")).await;
let v = get(&base, "/api/health").await;
assert_eq!(v["status"], "ok");
assert_eq!(v["run_count"], 5);
}
#[tokio::test]
async fn dimensions_lists_distinct_values_and_builds_chronologically() {
let base = spawn(&seed("dims")).await;
let v = get(&base, "/api/dimensions").await;
let hosts: Vec<&str> = v["hosts"]
.as_array()
.unwrap()
.iter()
.map(|x| x.as_str().unwrap())
.collect();
assert_eq!(hosts, vec!["beast", "benjy"]);
assert_eq!(v["models"].as_array().unwrap().len(), 2);
// builds ordered by earliest build_timestamp: old before new
let builds = v["builds"].as_array().unwrap();
assert_eq!(builds[0]["git_sha"], "old");
assert_eq!(builds[1]["git_sha"], "new");
}
#[tokio::test]
async fn summary_uses_latest_sha_and_ignores_failures() {
let base = spawn(&seed("summary")).await;
let v = get(&base, "/api/summary").await;
let rows = v.as_array().unwrap();
let beast = rows
.iter()
.find(|r| r["target_name"] == "beast" && r["scenario_id"] == "chat:128")
.unwrap();
assert_eq!(beast["git_sha"], "new");
assert_eq!(beast["samples"], 2); // two ok rows on "new"; failure excluded
// median of 0.10 and 0.12
assert!((beast["ttft_s_median"].as_f64().unwrap() - 0.11).abs() < 1e-9);
}
#[tokio::test]
async fn series_is_chronological_per_build() {
let base = spawn(&seed("series")).await;
let v = get(&base, "/api/series?host=beast&model=m&scenario=chat:128").await;
let pts = v.as_array().unwrap();
assert_eq!(pts.len(), 2);
assert_eq!(pts[0]["git_sha"], "old");
assert_eq!(pts[1]["git_sha"], "new");
assert_eq!(pts[0]["samples"], 1);
assert_eq!(pts[1]["samples"], 2);
}
#[tokio::test]
async fn series_resolves_host_when_omitted() {
// The public UI selects by model alone; the store resolves the host.
let base = spawn(&seed("series-nohost")).await;
let v = get(&base, "/api/series?model=m&scenario=chat:128").await;
let pts = v.as_array().unwrap();
assert_eq!(pts.len(), 2);
assert_eq!(pts[0]["git_sha"], "old");
assert_eq!(pts[1]["git_sha"], "new");
}
#[tokio::test]
async fn runs_filters_by_host() {
let base = spawn(&seed("runs")).await;
let all = get(&base, "/api/runs").await;
assert_eq!(all.as_array().unwrap().len(), 5);
let beast = get(&base, "/api/runs?host=beast").await;
let rows = beast.as_array().unwrap();
assert_eq!(rows.len(), 4);
assert!(rows.iter().all(|r| r["host"] == "beast"));
// failed row carries its error + ok=false
assert!(
rows.iter()
.any(|r| r["ok"] == false && r["error"] == "boom")
);
}

View File

@@ -0,0 +1,133 @@
//! End-to-end sweep against a mock neuron: a sweep records samples, a
//! second sweep skips the satisfied cell, and bumping the reported build
//! SHA resumes fresh sampling.
use axum::Router;
use axum::extract::State;
use axum::http::header;
use axum::response::{IntoResponse, Json};
use axum::routing::{get, post};
use helexa_bench::config::{BenchConfig, BenchSettings, ScenarioConfig, TargetConfig, TargetKind};
use helexa_bench::sweep::Sweeper;
use serde_json::json;
use std::sync::{Arc, Mutex};
#[derive(Clone)]
struct MockState {
sha: Arc<Mutex<String>>,
}
async fn version(State(s): State<MockState>) -> Json<serde_json::Value> {
let sha = s.sha.lock().unwrap().clone();
Json(json!({
"package_version": "0.1.16",
"git_sha": sha,
"git_dirty": false,
"features": ["cuda", "cudnn"],
"candle_version": "0.10.2",
}))
}
async fn discovery() -> Json<serde_json::Value> {
Json(json!({
"hostname": "mock-beast",
"os": "Linux",
"kernel": "6.19.0",
"cuda_version": "13.0",
"driver_version": "580.159",
"devices": [{"index": 0, "name": "RTX 5090", "vram_total_mb": 32614, "compute_capability": "12.0"}],
"harnesses": ["candle"],
}))
}
async fn models() -> Json<serde_json::Value> {
Json(json!([
{"id": "Qwen/Qwen3.6-27B", "harness": "candle", "status": "loaded", "devices": [0], "capabilities": ["text"]},
// A non-warm model the bench must ignore.
{"id": "Qwen/cold", "harness": "candle", "status": "recovering", "devices": [0]},
]))
}
async fn chat() -> impl IntoResponse {
let body = concat!(
"data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"Hello\"},\"finish_reason\":null}]}\n\n",
"data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\" world\"},\"finish_reason\":null}]}\n\n",
"data: {\"choices\":[{\"index\":0,\"delta\":{},\"finish_reason\":\"stop\"}],\"usage\":{\"prompt_tokens\":130,\"completion_tokens\":2,\"total_tokens\":132}}\n\n",
"data: [DONE]\n\n",
);
([(header::CONTENT_TYPE, "text/event-stream")], body)
}
async fn spawn_mock(sha: &str) -> (String, Arc<Mutex<String>>) {
let shared = Arc::new(Mutex::new(sha.to_string()));
let state = MockState {
sha: shared.clone(),
};
let app = Router::new()
.route("/version", get(version))
.route("/discovery", get(discovery))
.route("/models", get(models))
.route("/v1/chat/completions", post(chat))
.with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(format!("http://{addr}"), shared)
}
fn config_for(endpoint: String, db_path: String) -> BenchConfig {
BenchConfig {
bench: BenchSettings {
sweep_interval_secs: 1,
samples_per_version: 2,
iteration_pause_secs: 0,
request_timeout_secs: 30,
db_path,
},
scenarios: ScenarioConfig {
prompt_sizes: vec![128], // single scenario keeps assertions simple
max_tokens: 16,
},
api: Default::default(),
targets: vec![TargetConfig {
name: "mock".into(),
kind: TargetKind::Neuron,
endpoint,
label: None,
}],
}
}
#[tokio::test]
async fn sweep_records_skips_and_resumes_on_new_sha() {
let (endpoint, sha_handle) = spawn_mock("aaaaaaa").await;
// Unique db path per run (bound port is unique).
let port = endpoint.rsplit(':').next().unwrap();
let db_path = std::env::temp_dir().join(format!("helexa-bench-it-{port}.sqlite"));
let _ = std::fs::remove_file(&db_path);
let db_str = db_path.to_string_lossy().to_string();
let sweeper = Sweeper::new(config_for(endpoint, db_str)).unwrap();
// First sweep: one warm model × one scenario × 2 samples.
let s1 = sweeper.run_once().await.unwrap();
assert_eq!(s1.measured, 2, "should record samples_per_version samples");
assert_eq!(s1.skipped, 0);
assert_eq!(s1.failed, 0);
// Second sweep at same SHA: cell satisfied, nothing measured.
let s2 = sweeper.run_once().await.unwrap();
assert_eq!(s2.measured, 0, "satisfied cell must be skipped");
assert_eq!(s2.skipped, 1);
// Bump the reported build SHA: a new cell → fresh sampling resumes.
*sha_handle.lock().unwrap() = "bbbbbbb".to_string();
let s3 = sweeper.run_once().await.unwrap();
assert_eq!(s3.measured, 2, "new SHA must resume sampling");
assert_eq!(s3.skipped, 0);
let _ = std::fs::remove_file(&db_path);
}

View File

@@ -60,6 +60,11 @@ tokio-stream.workspace = true
figment.workspace = true
toml.workspace = true
# Parallel in-situ quantization (#1): fans candle's per-block k-quant
# math across the CPU pool at model-load time. Already in the tree
# transitively via candle-core.
rayon = "1"
# candle for in-process inference. CUDA support is gated behind the
# crate's `cuda` feature (default off) so the workspace builds on
# non-CUDA hosts and CI runners.
@@ -76,20 +81,31 @@ cudarc = { version = "0.19", optional = true, default-features = false, features
half = { version = "2.5", optional = true }
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
hf-hub = { version = "0.4", features = ["tokio"] }
# Jinja-compatible template renderer for the model's
# `tokenizer_config.json::chat_template`. Hugging Face's chat
# templates use a strict subset of Jinja2 that minijinja supports
# out of the box. ~80KB compiled; pure Rust, no async surface.
# Features: `builtins` for the `is defined` / `default` filters HF
# templates use; `json` for `tojson` (some Qwen3 templates emit
# tool definitions via tojson); `serde` so we can hand it a
# serde_json::Value as the context.
# Jinja-compatible template renderer for the model's chat template
# (standalone `chat_template.jinja` or `tokenizer_config.json::chat_template`).
# Hugging Face's chat templates lean on Python string semantics; we
# bridge them with `minijinja-contrib`'s `pycompat` callback (str
# methods like `startswith`/`split`/`strip`) plus a `raise_exception`
# global. Features: `builtins` for `is defined` / `default`; `json`
# for `tojson`; `serde` so we can hand it a serde_json::Value context.
minijinja = { version = "2", features = ["builtins", "json", "serde"] }
# Python-compatibility shim: the Qwen3-VL / Qwen3.6 template uses
# `content.startswith(...)`, `.endswith(...)`, `.split(...)`,
# `.rstrip(...)`, `.lstrip(...)` — Python str methods minijinja doesn't
# implement natively. `pycompat::unknown_method_callback` supplies them.
minijinja-contrib = { version = "2", features = ["pycompat"] }
# Direct dep on `safetensors` (re-exported by candle but its `TensorView`
# / `slice::IndexOp` types are public-but-not-re-exported). Used by the
# tp `fused_load` module to read per-rank slices of fused QKV tensors
# without materialising the full tensor on device.
safetensors = "0.7"
# Vision capability for Qwen3.6 (Stage A of the vision plan in
# doc/vision-qwen3_6-spec.md). `image` decodes PNG/JPEG/etc from
# the bytes embedded in `data:image/...;base64,...` content parts;
# `base64` does the URI decode. Default-features off on `image` to
# avoid pulling in audio/video formats we don't need.
image = { version = "0.25", default-features = false, features = ["png", "jpeg", "webp", "bmp", "gif"] }
base64 = "0.22"
[dev-dependencies]
tokio = { workspace = true, features = ["test-util"] }

View File

@@ -1,10 +1,16 @@
//! Build script: compile the CUDA kernels in `src/cuda/*.cu` into a
//! static library and link it under the `cuda` feature.
//! Build script: capture build/version metadata for `GET /version`,
//! and (under the `cuda` feature) compile the CUDA kernels in
//! `src/cuda/*.cu` into a static library and link it.
//!
//! Patterned on `EricLBuehler/mistral.rs::mistralrs-core/build.rs` —
//! same `cudaforge::KernelBuilder` invocation, same NVCC flag set.
//! The CUDA portion is patterned on
//! `EricLBuehler/mistral.rs::mistralrs-core/build.rs` — same
//! `cudaforge::KernelBuilder` invocation, same NVCC flag set.
use std::process::Command;
fn main() {
emit_build_metadata();
#[cfg(feature = "cuda")]
{
use std::path::PathBuf;
@@ -64,3 +70,127 @@ fn main() {
}
}
}
/// Emit `cargo:rustc-env=` vars consumed by `env!()` in `src/version.rs`
/// so the daemon can report its own build identity from `GET /version`.
///
/// We re-run only when HEAD moves or the SHA override changes — not on
/// every compile — so the captured timestamp is stable for a given
/// build input rather than churning on each `cargo build`.
fn emit_build_metadata() {
println!("cargo:rerun-if-env-changed=HELEXA_BUILD_SHA");
println!("cargo:rerun-if-changed=.git/HEAD");
// A detached/normal HEAD points at a ref whose file is what actually
// changes on commit; watch the packed-refs fallback too.
println!("cargo:rerun-if-changed=.git/packed-refs");
// SHA: prefer the CI/RPM-injected override (tarball builds have no
// .git), then fall back to git, then to "unknown".
let (sha_short, sha_long, dirty) = match std::env::var("HELEXA_BUILD_SHA") {
Ok(s) if !s.trim().is_empty() => {
let s = s.trim().to_string();
let short = s.chars().take(7).collect::<String>();
(short, Some(s), false)
}
_ => {
let long = git(&["rev-parse", "HEAD"]);
let short = git(&["rev-parse", "--short", "HEAD"]);
let dirty = git(&["status", "--porcelain"])
.map(|s| !s.trim().is_empty())
.unwrap_or(false);
match short {
Some(short) => (short, long, dirty),
None => ("unknown".to_string(), None, false),
}
}
};
println!("cargo:rustc-env=HELEXA_GIT_SHA={sha_short}");
println!(
"cargo:rustc-env=HELEXA_GIT_SHA_LONG={}",
sha_long.unwrap_or_default()
);
println!("cargo:rustc-env=HELEXA_GIT_DIRTY={dirty}");
// RFC3339 build timestamp. `date` is universally present on the
// Linux hosts neuron targets; empty if it ever isn't.
let ts = Command::new("date")
.args(["-u", "+%Y-%m-%dT%H:%M:%SZ"])
.output()
.ok()
.filter(|o| o.status.success())
.map(|o| String::from_utf8_lossy(&o.stdout).trim().to_string())
.unwrap_or_default();
println!("cargo:rustc-env=HELEXA_BUILD_TIMESTAMP={ts}");
// Compiler version: cargo sets $RUSTC to the rustc it invokes.
let rustc = std::env::var("RUSTC").unwrap_or_else(|_| "rustc".to_string());
let rustc_version = Command::new(rustc)
.arg("--version")
.output()
.ok()
.filter(|o| o.status.success())
.map(|o| String::from_utf8_lossy(&o.stdout).trim().to_string())
.unwrap_or_default();
println!("cargo:rustc-env=HELEXA_RUSTC_VERSION={rustc_version}");
println!(
"cargo:rustc-env=HELEXA_BUILD_PROFILE={}",
std::env::var("PROFILE").unwrap_or_default()
);
println!(
"cargo:rustc-env=HELEXA_TARGET={}",
std::env::var("TARGET").unwrap_or_default()
);
// Enabled features: cargo exports CARGO_FEATURE_<NAME> for each.
// Reverse the mangling (uppercase, '-'→'_') best-effort for display.
let mut features: Vec<String> = std::env::vars()
.filter_map(|(k, _)| k.strip_prefix("CARGO_FEATURE_").map(|f| f.to_string()))
.map(|f| f.to_lowercase().replace('_', "-"))
// `default` is the meta-feature, not a perf-relevant flag.
.filter(|f| f != "default")
.collect();
features.sort();
println!("cargo:rustc-env=HELEXA_FEATURES={}", features.join(","));
println!(
"cargo:rustc-env=HELEXA_CANDLE_VERSION={}",
candle_version().unwrap_or_default()
);
}
fn git(args: &[&str]) -> Option<String> {
let out = Command::new("git").args(args).output().ok()?;
if !out.status.success() {
return None;
}
let s = String::from_utf8_lossy(&out.stdout).trim().to_string();
if s.is_empty() { None } else { Some(s) }
}
/// Best-effort: read the locked `candle-core` version from the workspace
/// `Cargo.lock` (two levels up from this crate). Returns `None` if the
/// lockfile is absent (e.g. some packaging flows) or the entry isn't
/// found.
fn candle_version() -> Option<String> {
let manifest = std::env::var("CARGO_MANIFEST_DIR").ok()?;
let lock = std::path::Path::new(&manifest)
.join("..")
.join("..")
.join("Cargo.lock");
println!("cargo:rerun-if-changed={}", lock.display());
let text = std::fs::read_to_string(lock).ok()?;
// Cargo.lock entries are `[[package]]\nname = "x"\nversion = "y"`.
let mut in_candle = false;
for line in text.lines() {
let line = line.trim();
if line == "[[package]]" {
in_candle = false;
} else if line == "name = \"candle-core\"" {
in_candle = true;
} else if in_candle && let Some(rest) = line.strip_prefix("version = \"") {
return Some(rest.trim_end_matches('"').to_string());
}
}
None
}

View File

@@ -41,6 +41,7 @@ pub struct NeuronState {
/// Build the neuron API router.
pub fn neuron_routes() -> Router<Arc<NeuronState>> {
Router::new()
.route("/version", get(version_handler))
.route("/discovery", get(discovery_handler))
.route("/health", get(health_handler))
.route("/models", get(list_models))
@@ -51,6 +52,14 @@ pub fn neuron_routes() -> Router<Arc<NeuronState>> {
.route("/v1/responses", post(responses))
}
/// `GET /version` — the daemon's own build identity (git SHA, enabled
/// features, rustc/candle versions). Static for the process lifetime, so
/// no state is touched. This is the canonical "which build is live"
/// probe for fleet validation and benchmark attribution.
async fn version_handler() -> Json<cortex_core::build_info::BuildInfo> {
Json(crate::version::build_info())
}
async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> {
Json(state.discovery.clone())
}
@@ -81,6 +90,21 @@ async fn load_model(
State(state): State<Arc<NeuronState>>,
Json(spec): Json<ModelSpec>,
) -> impl IntoResponse {
// Driver/library mismatch preflight (#19): every CUDA load is
// guaranteed to fail until the host reboots. Reject up front with
// the operator-actionable reason instead of letting the load die
// minutes later inside cuInit/NCCL with a cryptic error.
if let Some(reason) = &state.discovery.cuda_unavailable_reason {
tracing::warn!(model = %spec.model_id, reason = %reason, "load_model rejected: CUDA unavailable");
return (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": reason,
"code": "cuda_unavailable",
})),
)
.into_response();
}
let registry = state.registry.read().await;
match registry.load_model(&spec).await {
Ok(()) => Json(json!({"status": "loaded"})).into_response(),
@@ -174,13 +198,43 @@ async fn model_endpoint(
}
}
/// Default `chat_template_kwargs.enable_thinking` to `include_thinking`
/// when the client didn't set it explicitly, leaving any explicit client
/// choice untouched. See the call site in [`chat_completions`] for the
/// rationale (reasoning eating the token budget for clients that drop it).
fn default_enable_thinking(req: &mut ChatCompletionRequest, include_thinking: bool) {
if req
.extra
.get("chat_template_kwargs")
.and_then(|k| k.get("enable_thinking"))
.is_some()
{
return; // client chose explicitly — respect it
}
if !req.extra.is_object() {
req.extra = json!({});
}
let Some(obj) = req.extra.as_object_mut() else {
return;
};
let kwargs = obj
.entry("chat_template_kwargs")
.or_insert_with(|| json!({}));
if !kwargs.is_object() {
*kwargs = json!({});
}
if let Some(kw) = kwargs.as_object_mut() {
kw.insert("enable_thinking".into(), json!(include_thinking));
}
}
/// OpenAI-compatible chat completions. Dispatches to streaming SSE when
/// `stream: true` is set on the request; otherwise returns a single
/// `ChatCompletionResponse`.
async fn chat_completions(
State(state): State<Arc<NeuronState>>,
headers: axum::http::HeaderMap,
Json(req): Json<ChatCompletionRequest>,
Json(mut req): Json<ChatCompletionRequest>,
) -> impl IntoResponse {
let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
return (
@@ -205,6 +259,18 @@ async fn chat_completions(
reasoning_markers: None, // filled in from the loaded model inside candle
};
// Couple reasoning *generation* to reasoning *surfacing*. Reasoning
// models (Qwen3.6) think by default, and that `<think>` block can
// consume the entire `max_tokens` budget — which, when we then drop
// it (`include_thinking == false`, the default for OpenAI/Anthropic
// clients like Claude Code), leaves the visible answer empty or
// truncated. So when the caller isn't going to see the reasoning,
// don't generate it: default `enable_thinking` to `include_thinking`.
// A client that explicitly set `chat_template_kwargs.enable_thinking`
// wins; thinking-aware clients (helexa-acp, `x-include-thinking:
// true`) keep reasoning on.
default_enable_thinking(&mut req, include_thinking);
if req.stream.unwrap_or(false) {
match candle.chat_completion_stream_with(req, chat_config).await {
Ok(rx) => {
@@ -220,80 +286,12 @@ async fn chat_completions(
.keep_alive(KeepAlive::default())
.into_response()
}
Err(InferenceError::ModelNotLoaded(id)) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
Err(InferenceError::PromptTooLong { prompt_len, max }) => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
Err(InferenceError::InsufficientVram {
free_mb,
required_mb,
}) => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
)
.into_response(),
Err(InferenceError::Other(e)) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
Err(e) => inference_error_response(e),
}
} else {
match candle.chat_completion(req).await {
Ok(resp) => Json(resp).into_response(),
Err(InferenceError::ModelNotLoaded(id)) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
Err(InferenceError::PromptTooLong { prompt_len, max }) => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
Err(InferenceError::InsufficientVram {
free_mb,
required_mb,
}) => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
)
.into_response(),
Err(InferenceError::Other(e)) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
Err(e) => inference_error_response(e),
}
}
}
@@ -392,6 +390,9 @@ async fn responses(
input_tokens: u.prompt_tokens,
output_tokens: u.completion_tokens,
total_tokens: u.prompt_tokens + u.completion_tokens,
// Non-streaming reasoning accounting deferred (#64).
output_tokens_details: None,
input_tokens_details: None,
});
let meta = openai_responses::ResponseMeta {
response_id: mint_response_id(),
@@ -418,46 +419,94 @@ fn finish_reason_from_str(s: &str) -> crate::wire::FinishReason {
}
/// Centralised mapping from [`InferenceError`] to an HTTP response.
/// Lifted out so the chat-completions and responses handlers stay
/// readable and changes to error-code semantics happen in one spot.
///
/// Emits the OpenAI-standard *nested* error envelope:
///
/// ```json
/// { "error": { "message": "...", "type": "...", "code": "...", "param": null } }
/// ```
///
/// OpenAI-compatible clients (opencode, the openai SDK) reach into
/// `error.type` / `error.code` to drive behaviour — most importantly,
/// `code == "context_length_exceeded"` triggers auto-compaction and
/// retry rather than a hard failure. A flat `{"error": "..."}` string
/// is invisible to that logic, so every variant nests here. Diagnostic
/// extras (prompt_len, free_mb, …) ride *inside* the error object so
/// they don't break the envelope shape.
fn inference_error_response(err: InferenceError) -> axum::response::Response {
match err {
InferenceError::ModelNotLoaded(id) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
use cortex_core::error_envelope::OpenAiError;
let env = match err {
InferenceError::ModelNotLoaded(id) => OpenAiError::new(
404,
"invalid_request_error",
"model_not_found",
format!("model '{id}' not loaded on this neuron"),
)
.into_response(),
InferenceError::PromptTooLong { prompt_len, max } => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
.with_extra("model_id", json!(id)),
// OpenAI's canonical context-overflow error. opencode keys on
// `code == "context_length_exceeded"` and the message phrasing
// ("maximum context length is N tokens") to auto-compact+retry.
InferenceError::PromptTooLong { prompt_len, max } => {
OpenAiError::context_length_exceeded(format!(
"This model's maximum context length is {max} tokens. \
However, your messages resulted in {prompt_len} tokens. \
Please reduce the length of the messages."
))
.with_extra("prompt_len", json!(prompt_len))
.with_extra("max", json!(max))
}
// VRAM frees as the in-flight request(s) complete, so this is a
// transient 503 — advertise a short Retry-After (#63).
InferenceError::InsufficientVram {
free_mb,
required_mb,
} => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
} => OpenAiError::new(
503,
"api_error",
"insufficient_vram",
format!("insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"),
)
.into_response(),
InferenceError::Other(e) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
.with_retry_after(5)
.with_extra("free_mb", json!(free_mb))
.with_extra("required_mb", json!(required_mb)),
InferenceError::VisionUnsupported { model_id } => OpenAiError::new(
400,
"invalid_request_error",
"vision_unsupported",
format!("model '{model_id}' does not support image input"),
)
.into_response(),
.with_extra("model_id", json!(model_id))
.with_extra(
"suggestion",
json!("load a vision-capable model or remove image_url content parts"),
),
InferenceError::TemplateRenderFailed { detail } => OpenAiError::new(
422,
"invalid_request_error",
"template_render_failed",
format!("chat template could not render this request: {detail}"),
),
InferenceError::Other(e) => OpenAiError::without_code(500, "api_error", format!("{e:#}")),
};
envelope_response(env)
}
/// Neuron adapter: turn the shared [`cortex_core::error_envelope::OpenAiError`]
/// into an axum response, setting `Retry-After` when the envelope carries one.
/// cortex-core owns the envelope shape (#60/#63); this is the only crossing
/// from that data into axum on the neuron side.
fn envelope_response(err: cortex_core::error_envelope::OpenAiError) -> axum::response::Response {
let status = StatusCode::from_u16(err.status).unwrap_or(StatusCode::INTERNAL_SERVER_ERROR);
let retry_after = err.retry_after_secs;
let mut response = (status, Json(err.body())).into_response();
if let Some(secs) = retry_after
&& let Ok(value) = axum::http::HeaderValue::from_str(&secs.to_string())
{
response
.headers_mut()
.insert(axum::http::header::RETRY_AFTER, value);
}
response
}
fn mint_response_id() -> String {
@@ -481,3 +530,173 @@ fn unix_subsec_nanos() -> u64 {
.map(|d| d.as_nanos() as u64)
.unwrap_or(0)
}
#[cfg(test)]
mod thinking_tests {
use super::*;
fn req(value: serde_json::Value) -> ChatCompletionRequest {
serde_json::from_value(value).expect("valid ChatCompletionRequest")
}
fn enable_thinking(r: &ChatCompletionRequest) -> Option<bool> {
r.extra
.get("chat_template_kwargs")
.and_then(|k| k.get("enable_thinking"))
.and_then(|v| v.as_bool())
}
#[test]
fn defaults_enable_thinking_to_include_thinking_false() {
let mut r = req(json!({"model": "m", "messages": []}));
default_enable_thinking(&mut r, false);
assert_eq!(enable_thinking(&r), Some(false));
}
#[test]
fn defaults_enable_thinking_true_when_surfacing() {
let mut r = req(json!({"model": "m", "messages": []}));
default_enable_thinking(&mut r, true);
assert_eq!(enable_thinking(&r), Some(true));
}
#[test]
fn explicit_client_choice_is_respected() {
let mut r = req(json!({
"model": "m", "messages": [],
"chat_template_kwargs": {"enable_thinking": true}
}));
// include_thinking=false would normally force false; explicit wins.
default_enable_thinking(&mut r, false);
assert_eq!(enable_thinking(&r), Some(true));
}
#[test]
fn preserves_other_chat_template_kwargs() {
let mut r = req(json!({
"model": "m", "messages": [],
"chat_template_kwargs": {"some_other": 42}
}));
default_enable_thinking(&mut r, false);
assert_eq!(enable_thinking(&r), Some(false));
assert_eq!(
r.extra["chat_template_kwargs"]["some_other"],
json!(42),
"existing kwargs must survive"
);
}
}
#[cfg(test)]
mod error_envelope_tests {
use super::*;
use axum::http::StatusCode;
/// Drive an `InferenceError` through the mapper and decode the
/// `(status, json)` pair it produces.
async fn map(err: InferenceError) -> (StatusCode, Value) {
let resp = inference_error_response(err);
let status = resp.status();
let bytes = axum::body::to_bytes(resp.into_body(), usize::MAX)
.await
.expect("buffer error body");
let body: Value = serde_json::from_slice(&bytes).expect("error body is JSON");
(status, body)
}
#[tokio::test]
async fn prompt_too_long_is_context_length_exceeded() {
let (status, body) = map(InferenceError::PromptTooLong {
prompt_len: 60_000,
max: 49_152,
})
.await;
assert_eq!(status, StatusCode::BAD_REQUEST);
// The envelope must be nested under `error`, not a flat string.
let error = body
.get("error")
.and_then(Value::as_object)
.expect("error object");
assert_eq!(error["type"], "invalid_request_error");
assert_eq!(
error["code"], "context_length_exceeded",
"opencode keys on this code to auto-compact and retry"
);
assert_eq!(error["param"], Value::Null);
// Phrasing opencode/openai clients pattern-match on.
let msg = error["message"].as_str().unwrap();
assert!(
msg.contains("maximum context length is 49152 tokens"),
"message was: {msg}"
);
// Diagnostics ride inside the error object.
assert_eq!(error["prompt_len"], 60_000);
assert_eq!(error["max"], 49_152);
}
#[tokio::test]
async fn model_not_loaded_is_404_model_not_found() {
let (status, body) = map(InferenceError::ModelNotLoaded("Qwen/X".into())).await;
assert_eq!(status, StatusCode::NOT_FOUND);
let error = &body["error"];
assert_eq!(error["type"], "invalid_request_error");
assert_eq!(error["code"], "model_not_found");
assert_eq!(error["model_id"], "Qwen/X");
}
#[tokio::test]
async fn insufficient_vram_is_503_api_error() {
let (status, body) = map(InferenceError::InsufficientVram {
free_mb: 1_024,
required_mb: 8_192,
})
.await;
assert_eq!(status, StatusCode::SERVICE_UNAVAILABLE);
let error = &body["error"];
assert_eq!(error["type"], "api_error");
assert_eq!(error["code"], "insufficient_vram");
assert_eq!(error["free_mb"], 1_024);
assert_eq!(error["required_mb"], 8_192);
}
#[tokio::test]
async fn insufficient_vram_carries_retry_after() {
// Transient 503 — VRAM frees as in-flight requests finish, so the
// client should back off and retry (#63).
let resp = inference_error_response(InferenceError::InsufficientVram {
free_mb: 1_024,
required_mb: 8_192,
});
let retry = resp
.headers()
.get(axum::http::header::RETRY_AFTER)
.expect("transient 503 must advertise Retry-After");
assert_eq!(retry.to_str().unwrap(), "5");
}
#[tokio::test]
async fn permanent_rejections_have_no_retry_after() {
// context_length_exceeded is permanent for this request — no hint.
let resp = inference_error_response(InferenceError::PromptTooLong {
prompt_len: 60_000,
max: 49_152,
});
assert!(
resp.headers()
.get(axum::http::header::RETRY_AFTER)
.is_none(),
"permanent rejection must not advertise Retry-After"
);
}
#[tokio::test]
async fn other_is_500_with_null_code() {
let (status, body) = map(InferenceError::Other(anyhow::anyhow!("kaboom"))).await;
assert_eq!(status, StatusCode::INTERNAL_SERVER_ERROR);
let error = &body["error"];
assert_eq!(error["type"], "api_error");
assert_eq!(error["code"], Value::Null);
assert!(error["message"].as_str().unwrap().contains("kaboom"));
}
}

View File

@@ -6,8 +6,18 @@ use figment::{
providers::{Env, Format, Toml},
};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::path::{Path, PathBuf};
/// Default scheme name applied to bare `org/name` model ids when no
/// `[harness.candle.default_source]` is set. Keeps existing operator
/// configs (which know nothing about schemes) working unchanged.
pub const DEFAULT_SOURCE_SCHEME: &str = "huggingface";
/// Endpoint URL for the default huggingface source, used when no
/// `[harness.candle.sources.huggingface]` is configured.
pub const DEFAULT_HF_ENDPOINT: &str = "https://huggingface.co";
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct NeuronConfig {
#[serde(default = "default_port")]
@@ -37,8 +47,229 @@ pub struct HarnessSettings {
pub struct CandleHarnessConfig {
/// HuggingFace cache directory for model weights.
/// When unset, defers to hf-hub's default (~/.cache/huggingface).
///
/// Retained for back-compat — operators with existing
/// `hf_cache = "..."` configs continue to work. Treated as the
/// `huggingface` source's cache_dir when a sources table isn't
/// provided.
#[serde(default)]
pub hf_cache: Option<PathBuf>,
/// Default source scheme applied to bare `org/name` model ids
/// (those without an explicit `scheme:` prefix). When unset, falls
/// back to `DEFAULT_SOURCE_SCHEME` ("huggingface").
#[serde(default)]
pub default_source: Option<String>,
/// Per-scheme source endpoints. Each entry maps a scheme name
/// (`huggingface`, `helexa`, an operator's mirror tag, …) to its
/// endpoint URL, optional auth env var, and optional cache
/// directory.
///
/// When absent or missing the `huggingface` key, the loader
/// synthesises a `huggingface` entry pointing at
/// `https://huggingface.co` with `hf_cache` (above) as its
/// cache_dir. This keeps single-source configs ergonomic.
#[serde(default)]
pub sources: HashMap<String, SourceConfig>,
/// Prefix KV cache across requests (#11). Applies per loaded
/// model, on architectures that support cache snapshots (qwen3_5).
#[serde(default)]
pub prefix_cache: PrefixCacheConfig,
/// Self-derived context/token limits (#67). The neuron computes the
/// most-efficient `limit{context,input,output}` that still allows
/// coherent agentic performance from model architecture + live free
/// VRAM + a self-measured throughput ceiling, advertises it on
/// `/models`, and enforces it. These knobs tune that derivation.
#[serde(default)]
pub context_limit: ContextLimitConfig,
}
/// `[harness.candle.prefix_cache]` settings.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PrefixCacheConfig {
/// Master switch. On by default — set `false` to restore the
/// clear-every-request behaviour.
#[serde(default = "default_prefix_cache_enabled")]
pub enabled: bool,
/// Snapshot byte budget per loaded model, in MiB. Snapshots live
/// on the model's device, so this comes out of the same VRAM that
/// serves inference — size it against the device's headroom after
/// the model weights.
#[serde(default = "default_prefix_cache_budget_mb")]
pub budget_mb: u64,
/// Maximum live snapshots per loaded model, regardless of budget.
#[serde(default = "default_prefix_cache_max_entries")]
pub max_entries: usize,
}
impl Default for PrefixCacheConfig {
fn default() -> Self {
Self {
enabled: default_prefix_cache_enabled(),
budget_mb: default_prefix_cache_budget_mb(),
max_entries: default_prefix_cache_max_entries(),
}
}
}
fn default_prefix_cache_enabled() -> bool {
true
}
fn default_prefix_cache_budget_mb() -> u64 {
1024
}
fn default_prefix_cache_max_entries() -> usize {
8
}
/// `[harness.candle.context_limit]` settings (#67).
///
/// The derived limit is `context = min(max_position_embeddings,
/// vram_ceiling, throughput_ceiling)`, then `input = context
/// output_reserve`. `vram_ceiling` and `throughput_ceiling` read live
/// state, so the advertised/enforced limit tracks the resident model and
/// rises automatically as efficiency work (e.g. prefix caching, #11)
/// frees headroom or speeds prefill — no operator action.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ContextLimitConfig {
/// Master switch. On by default — set `false` to fall back to the
/// static `NEURON_MAX_PROMPT_TOKENS` cap with no advertised limit.
#[serde(default = "default_context_limit_enabled")]
pub enabled: bool,
/// Coherence target: the longest prefill-per-turn latency (seconds)
/// considered acceptable agentic performance. The throughput ceiling
/// is `target_prefill_latency_secs × measured_prefill_tok_per_sec`.
/// Raise it once cross-request prefix caching (#11) makes long
/// contexts cheap to re-prefill.
#[serde(default = "default_target_prefill_latency_secs")]
pub target_prefill_latency_secs: f64,
/// Cold-start prefill speed (tokens/sec) used for the throughput
/// ceiling until the model has served enough requests to measure its
/// own rate. A conservative estimate; the live EMA supersedes it.
#[serde(default = "default_bootstrap_prefill_tok_per_sec")]
pub bootstrap_prefill_tok_per_sec: f64,
/// VRAM (MiB) reserved per card for prefill activations on top of the
/// resident weights and the KV cache, before computing the VRAM
/// context ceiling.
#[serde(default = "default_activation_headroom_mb")]
pub activation_headroom_mb: u64,
/// Free-VRAM floor (MiB) kept available per card — the VRAM ceiling
/// leaves at least this much unused. Mirrors `NEURON_MIN_FREE_VRAM_MB`.
#[serde(default = "default_context_min_free_floor_mb")]
pub min_free_floor_mb: u64,
/// Generation reserve (tokens) left below the context wall:
/// `input = context output_reserve_tokens`. Defaults to neuron's
/// default `max_tokens`.
#[serde(default = "default_output_reserve_tokens")]
pub output_reserve_tokens: usize,
}
impl Default for ContextLimitConfig {
fn default() -> Self {
Self {
enabled: default_context_limit_enabled(),
target_prefill_latency_secs: default_target_prefill_latency_secs(),
bootstrap_prefill_tok_per_sec: default_bootstrap_prefill_tok_per_sec(),
activation_headroom_mb: default_activation_headroom_mb(),
min_free_floor_mb: default_context_min_free_floor_mb(),
output_reserve_tokens: default_output_reserve_tokens(),
}
}
}
fn default_context_limit_enabled() -> bool {
true
}
fn default_target_prefill_latency_secs() -> f64 {
// ~2 min/turn is the coherence wall observed pre-#11 on beast
// (the issue's worked example). Raisable once prefix caching lands.
120.0
}
fn default_bootstrap_prefill_tok_per_sec() -> f64 {
// beast Qwen3.6-27B TP=2 measured ~850 tok/s prefill; a conservative
// floor so the cold-start ceiling isn't wildly optimistic.
800.0
}
fn default_activation_headroom_mb() -> u64 {
2048
}
fn default_context_min_free_floor_mb() -> u64 {
1500
}
fn default_output_reserve_tokens() -> usize {
8192
}
/// Per-scheme source configuration. Mirrors the shape `hf_hub::ApiBuilder`
/// needs: endpoint URL, optional auth token (read from an env var so
/// secrets stay out of the config file), and optional cache directory
/// disambiguated per source to prevent mirror-vs-canonical collisions.
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct SourceConfig {
/// Base URL of the registry. Must speak the HF-compatible wire
/// format (siblings listing at
/// `/api/models/{org}/{name}[/revision/{rev}]`, blob fetch at
/// `/{org}/{name}/resolve/{rev}/{path}`).
pub endpoint: String,
/// Environment variable name to read for the bearer token used
/// against this source. `None` = anonymous. Reading from env
/// (vs. literal token in the config) keeps secrets out of TOML.
#[serde(default)]
pub auth_env: Option<String>,
/// Cache directory for this source. The hf-hub
/// `models--{org}--{name}/snapshots/...` tree lives directly
/// under this path, so distinct sources serving the same
/// `org/name` cannot collide on disk.
///
/// `None` means "share the harness `hf_cache` directory" — only
/// safe when the operator has exactly one source configured.
#[serde(default)]
pub cache_dir: Option<PathBuf>,
}
impl CandleHarnessConfig {
/// Resolve the effective sources map for this config, synthesising
/// a `huggingface` entry from legacy fields (`hf_cache`) when the
/// operator hasn't supplied a sources table. Idempotent.
///
/// Returns a fresh map rather than mutating self so the original
/// (operator-typed) config can still be serialized back to TOML
/// for diagnostics.
pub fn effective_sources(&self) -> HashMap<String, SourceConfig> {
let mut out = self.sources.clone();
out.entry(DEFAULT_SOURCE_SCHEME.to_string())
.or_insert_with(|| SourceConfig {
endpoint: DEFAULT_HF_ENDPOINT.to_string(),
auth_env: Some("HF_TOKEN".to_string()),
cache_dir: self.hf_cache.clone(),
});
out
}
/// Effective default scheme. Falls back to `DEFAULT_SOURCE_SCHEME`
/// when the operator hasn't pinned one.
pub fn effective_default_source(&self) -> &str {
self.default_source
.as_deref()
.unwrap_or(DEFAULT_SOURCE_SCHEME)
}
}
fn default_port() -> u16 {
@@ -65,3 +296,109 @@ impl Default for NeuronConfig {
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn effective_sources_synthesises_huggingface_when_absent() {
let cfg = CandleHarnessConfig::default();
let sources = cfg.effective_sources();
assert!(sources.contains_key("huggingface"));
let hf = &sources["huggingface"];
assert_eq!(hf.endpoint, DEFAULT_HF_ENDPOINT);
assert_eq!(hf.auth_env.as_deref(), Some("HF_TOKEN"));
assert!(hf.cache_dir.is_none());
}
#[test]
fn effective_sources_carries_legacy_hf_cache_into_synth_entry() {
// Existing operator configs only set `hf_cache = "/archive3/..."`
// — the synth must pick that up so the loader keeps using the
// operator's storage.
let cfg = CandleHarnessConfig {
hf_cache: Some(PathBuf::from("/archive3/llm-cache")),
..Default::default()
};
let sources = cfg.effective_sources();
assert_eq!(
sources["huggingface"].cache_dir.as_deref(),
Some(Path::new("/archive3/llm-cache"))
);
}
#[test]
fn effective_sources_preserves_explicit_huggingface_entry() {
// When an operator types out `[harness.candle.sources.huggingface]`
// explicitly, we must not clobber it with the synth defaults.
let mut sources = HashMap::new();
sources.insert(
"huggingface".to_string(),
SourceConfig {
endpoint: "https://huggingface.example.org".into(),
auth_env: Some("MY_TOKEN".into()),
cache_dir: Some(PathBuf::from("/operator-cache")),
},
);
let cfg = CandleHarnessConfig {
hf_cache: Some(PathBuf::from("/legacy-cache")),
sources,
..Default::default()
};
let effective = cfg.effective_sources();
assert_eq!(
effective["huggingface"].endpoint,
"https://huggingface.example.org"
);
assert_eq!(
effective["huggingface"].auth_env.as_deref(),
Some("MY_TOKEN")
);
assert_eq!(
effective["huggingface"].cache_dir.as_deref(),
Some(Path::new("/operator-cache"))
);
}
#[test]
fn effective_sources_includes_helexa_alongside_synth_huggingface() {
let mut sources = HashMap::new();
sources.insert(
"helexa".to_string(),
SourceConfig {
endpoint: "https://registry.helexa.ai".into(),
auth_env: Some("HELEXA_TOKEN".into()),
cache_dir: Some(PathBuf::from("/archive3/llm-cache/helexa")),
},
);
let cfg = CandleHarnessConfig {
hf_cache: Some(PathBuf::from("/archive3/llm-cache/huggingface")),
sources,
..Default::default()
};
let effective = cfg.effective_sources();
assert_eq!(effective.len(), 2);
assert_eq!(effective["helexa"].endpoint, "https://registry.helexa.ai");
// huggingface still gets synth-derived from legacy hf_cache.
assert_eq!(
effective["huggingface"].cache_dir.as_deref(),
Some(Path::new("/archive3/llm-cache/huggingface"))
);
}
#[test]
fn effective_default_source_falls_back() {
let cfg = CandleHarnessConfig::default();
assert_eq!(cfg.effective_default_source(), DEFAULT_SOURCE_SCHEME);
}
#[test]
fn effective_default_source_honours_explicit() {
let cfg = CandleHarnessConfig {
default_source: Some("helexa".into()),
..Default::default()
};
assert_eq!(cfg.effective_default_source(), "helexa");
}
}

View File

@@ -100,6 +100,87 @@ pub fn parse_health_info(csv_output: &str) -> Result<Vec<DeviceHealth>> {
Ok(devices)
}
// ── Driver/library mismatch preflight (#19) ─────────────────────────
/// Classify a failed nvidia-smi invocation: is it the classic
/// "Driver/library version mismatch" (userspace libs updated, kernel
/// module not reloaded — every CUDA call on the host is dead until a
/// reboot)? Returns the userspace NVML library version when the
/// message carries one ("NVML library version: 580.159"), or
/// `Some("unknown")` for a mismatch without a parsable version.
/// `None` for any other failure — other errors (no devices, perms)
/// are NOT the mismatch and must not trigger the loud diagnosis.
pub fn classify_driver_mismatch(combined_output: &str) -> Option<String> {
if !combined_output.contains("Driver/library version mismatch") {
return None;
}
let userspace = combined_output
.lines()
.find_map(|l| l.trim().strip_prefix("NVML library version:"))
.map(|v| v.trim().to_string())
.filter(|v| !v.is_empty())
.unwrap_or_else(|| "unknown".to_string());
Some(userspace)
}
/// Extract the loaded kernel module's driver version from
/// `/proc/driver/nvidia/version` contents. Typical first line:
///
/// ```text
/// NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.159.03 Release Build (...)
/// ```
pub fn parse_kernel_module_version(proc_contents: &str) -> Option<String> {
let is_numeric = |p: &str| !p.is_empty() && p.chars().all(|c| c.is_ascii_digit());
let line = proc_contents
.lines()
.find(|l| l.starts_with("NVRM version:"))?;
line.split_whitespace()
.find(|tok| {
let mut parts = tok.split('.');
parts.next().is_some_and(is_numeric) && parts.next().is_some_and(is_numeric)
})
.map(|s| s.to_string())
}
/// Render the operator-actionable mismatch description carried in
/// `DiscoveryResponse::cuda_unavailable_reason` and logged at startup.
pub fn mismatch_reason(userspace: &str, kernel_module: Option<&str>) -> String {
format!(
"host NVIDIA driver/library mismatch (userspace NVML {userspace} vs loaded kernel \
module {}) — reboot the host to reload the kernel module; all CUDA inference is \
unavailable until then",
kernel_module.unwrap_or("unknown")
)
}
/// Outcome of an nvidia-smi invocation, distinguishing "binary not
/// present" (CPU-only host, not an error) from "present but failing"
/// (possible driver mismatch — worth classifying).
enum SmiOutcome {
Ok(String),
Failed(String),
Absent,
}
async fn run_nvidia_smi(args: &[&str]) -> SmiOutcome {
match tokio::process::Command::new("nvidia-smi")
.args(args)
.output()
.await
{
Err(_) => SmiOutcome::Absent,
Ok(out) if out.status.success() => {
SmiOutcome::Ok(String::from_utf8_lossy(&out.stdout).to_string())
}
Ok(out) => {
let mut combined = String::from_utf8_lossy(&out.stdout).to_string();
combined.push('\n');
combined.push_str(&String::from_utf8_lossy(&out.stderr));
SmiOutcome::Failed(combined)
}
}
}
// ── Command execution wrappers ──────────────────────────────────────
async fn run_command(cmd: &str, args: &[&str]) -> Result<String> {
@@ -139,23 +220,42 @@ pub async fn discover_system() -> Result<DiscoveryResponse> {
.trim()
.to_string();
let (devices, driver_version) = match run_command_optional(
"nvidia-smi",
&[
&format!("--query-gpu={NVIDIA_SMI_DISCOVERY_QUERY}"),
"--format=csv,noheader,nounits",
],
)
let (devices, driver_version, cuda_unavailable_reason) = match run_nvidia_smi(&[
&format!("--query-gpu={NVIDIA_SMI_DISCOVERY_QUERY}"),
"--format=csv,noheader,nounits",
])
.await
{
Some(output) => {
SmiOutcome::Ok(output) => {
let devs = parse_gpu_info(&output).unwrap_or_default();
let driver = parse_driver_version(&output);
(devs, driver)
(devs, driver, None)
}
None => {
SmiOutcome::Absent => {
tracing::info!("nvidia-smi not found — no GPU devices discovered");
(vec![], None)
(vec![], None, None)
}
SmiOutcome::Failed(combined) => {
// nvidia-smi exists but can't talk to the driver. The case
// worth diagnosing precisely is the userspace↔kernel-module
// version skew after an un-rebooted driver update (#19) —
// every CUDA call on the host fails until a reboot, and
// without this classification it surfaces as a cryptic
// NCCL/cuInit error deep inside the first model load.
let reason = classify_driver_mismatch(&combined).map(|userspace| {
let kmod = std::fs::read_to_string("/proc/driver/nvidia/version")
.ok()
.as_deref()
.and_then(parse_kernel_module_version);
mismatch_reason(&userspace, kmod.as_deref())
});
if reason.is_none() {
tracing::warn!(
output = %combined.trim(),
"nvidia-smi present but failing — no GPU devices discovered"
);
}
(vec![], None, reason)
}
};
@@ -172,6 +272,8 @@ pub async fn discover_system() -> Result<DiscoveryResponse> {
driver_version,
devices,
harnesses: vec![], // populated by harness registry in Phase 8
cuda_unavailable_reason,
max_prompt_tokens: crate::harness::candle::max_prompt_tokens() as u64,
})
}
@@ -272,4 +374,63 @@ mod tests {
assert_eq!(health[1].vram_used_mb, 4096);
assert_eq!(health[1].temp_c, 58);
}
// ── #19 driver/library mismatch preflight ────────────────────────
#[test]
fn classify_driver_mismatch_detects_and_extracts_nvml_version() {
// Verbatim shape of nvidia-smi's failure output on a host
// whose userspace libs were updated without a reboot.
let out = "Failed to initialize NVML: Driver/library version mismatch\n\
NVML library version: 580.159\n";
assert_eq!(classify_driver_mismatch(out).as_deref(), Some("580.159"));
}
#[test]
fn classify_driver_mismatch_without_version_line() {
let out = "Failed to initialize NVML: Driver/library version mismatch\n";
assert_eq!(classify_driver_mismatch(out).as_deref(), Some("unknown"));
}
#[test]
fn classify_driver_mismatch_ignores_other_failures() {
// Other nvidia-smi failures must NOT be diagnosed as the
// mismatch (no false positives on healthy or odd hosts).
for out in [
"No devices were found\n",
"Failed to initialize NVML: Insufficient Permissions\n",
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.\n",
"",
] {
assert_eq!(
classify_driver_mismatch(out),
None,
"false positive on: {out:?}"
);
}
}
#[test]
fn parse_kernel_module_version_from_proc() {
let proc = "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.159.03 Release Build (dvs-builder@U22-I3-AE24-12-2) Tue May 12 21:03:35 UTC 2026\n\
GCC version: gcc version 15.2.1 20251022 (Red Hat 15.2.1-3) (GCC)\n";
assert_eq!(
parse_kernel_module_version(proc).as_deref(),
Some("580.159.03")
);
}
#[test]
fn parse_kernel_module_version_absent() {
assert_eq!(parse_kernel_module_version(""), None);
assert_eq!(parse_kernel_module_version("GCC version: gcc 15\n"), None);
}
#[test]
fn mismatch_reason_is_operator_actionable() {
let reason = mismatch_reason("580.159", Some("580.159.03"));
assert!(reason.contains("580.159"));
assert!(reason.contains("580.159.03"));
assert!(reason.contains("reboot"));
}
}

View File

@@ -24,6 +24,7 @@ use super::linear_attn::GatedDeltaNet;
use super::mlp::Qwen3_5MLP;
use super::rmsnorm::Qwen3_5RmsNorm;
use super::rope::RotaryEmbedding;
use super::snapshot::LayerKvSnapshot;
/// One of the two attention flavours sitting in a decoder layer's
/// attention slot. Full-attention layers need the rotary table and
@@ -93,12 +94,13 @@ impl Qwen3_5DecoderLayer {
&mut self,
x: &Tensor,
attn_mask: Option<&Tensor>,
offset: usize,
cos: &Tensor,
sin: &Tensor,
) -> candle_core::Result<Tensor> {
let h = self.input_layernorm.forward(x)?;
let attn_out = match &mut self.attention {
AttentionKind::Full(attn) => attn.forward(&h, attn_mask, offset)?,
// Linear attention ignores attn_mask + offset; its causal
AttentionKind::Full(attn) => attn.forward(&h, attn_mask, cos, sin)?,
// Linear attention ignores attn_mask + rope; its causal
// structure is baked into the recurrent state lifecycle.
AttentionKind::Linear(net) => net.forward(&h)?,
};
@@ -114,4 +116,37 @@ impl Qwen3_5DecoderLayer {
AttentionKind::Linear(net) => net.clear_kv_cache(),
}
}
/// Capture this layer's cache state for a prefix snapshot.
pub fn snapshot_kv(&self) -> candle_core::Result<LayerKvSnapshot> {
Ok(match &self.attention {
AttentionKind::Full(attn) => LayerKvSnapshot::Full(attn.snapshot_kv()),
AttentionKind::Linear(net) => {
let (conv_state, recurrent_state) = net.snapshot_state()?;
LayerKvSnapshot::Linear {
conv_state,
recurrent_state,
}
}
})
}
/// Replace this layer's cache state from a snapshot. The snapshot
/// variant must match the layer's attention kind — a mismatch
/// means the snapshot came from a different model.
pub fn restore_kv(&mut self, snap: &LayerKvSnapshot) -> candle_core::Result<()> {
match (&mut self.attention, snap) {
(AttentionKind::Full(attn), LayerKvSnapshot::Full(kv)) => attn.restore_kv(kv.as_ref()),
(
AttentionKind::Linear(net),
LayerKvSnapshot::Linear {
conv_state,
recurrent_state,
},
) => net.restore_state(conv_state.as_ref(), recurrent_state.as_ref()),
_ => candle_core::bail!(
"restore_kv: snapshot layer kind does not match this layer's attention kind"
),
}
}
}

View File

@@ -96,7 +96,8 @@ impl Qwen3_5Attention {
&mut self,
x: &Tensor,
attn_mask: Option<&Tensor>,
offset: usize,
cos: &Tensor,
sin: &Tensor,
) -> candle_core::Result<Tensor> {
let (b, l, _) = x.dims3()?;
@@ -131,8 +132,9 @@ impl Qwen3_5Attention {
.transpose(1, 2)?
.contiguous()?;
// 3. RoPE on q, k.
let (q, k) = self.rotary.apply(&q, &k, offset)?;
// 3. RoPE on q, k (cos/sin built once per forward by the model —
// interleaved M-RoPE for image tokens, plain for text).
let (q, k) = self.rotary.apply_cos_sin(&q, &k, cos, sin)?;
// 4. KV cache.
let (k, v) = self.kv_cache.append(&k, &v)?;
@@ -163,6 +165,26 @@ impl Qwen3_5Attention {
pub fn clear_kv_cache(&mut self) {
self.kv_cache.reset();
}
/// Capture the KV cache contents for a prefix snapshot. Shallow
/// clones: `ConcatKvCache::append` cats into fresh allocations and
/// never mutates stored tensors in place, so the captured tensors
/// stay valid after the live cache moves on.
pub fn snapshot_kv(&self) -> Option<(Tensor, Tensor)> {
match (self.kv_cache.k(), self.kv_cache.v()) {
(Some(k), Some(v)) => Some((k.clone(), v.clone())),
_ => None,
}
}
/// Replace the live KV cache with a previously captured snapshot.
pub fn restore_kv(&mut self, snap: Option<&(Tensor, Tensor)>) -> candle_core::Result<()> {
self.kv_cache.reset();
if let Some((k, v)) = snap {
self.kv_cache.append(k, v)?;
}
Ok(())
}
}
fn load_linear_no_bias(

View File

@@ -49,11 +49,15 @@
//!
//! ## Performance note
//!
//! This impl is the **recurrent** delta-rule for both prefill and
//! decode — i.e. the algorithm in `torch_recurrent_gated_delta_rule`.
//! Correctness-first. The chunked algorithm (chunk_size=64) in
//! `torch_chunk_gated_delta_rule` is a perf optimisation for long
//! prefill; can be added later without changing the surface.
//! Prefill (seq_len ≥ 64) runs the **chunked** delta rule (#23) — the
//! algorithm in `torch_chunk_gated_delta_rule`, reorganised into
//! per-chunk batched matmuls; see [`run_chunk_gated_delta_rule`].
//! Decode steps and short prompts keep the **recurrent** per-token
//! rule (`torch_recurrent_gated_delta_rule`): a CUDA kernel on
//! device, a pure-Rust loop on CPU. Both produce identical results
//! (pinned by the `chunked_matches_recurrent_*` parity tests);
//! `NEURON_GDN_CHUNKED=0` forces the recurrent paths for A/B
//! measurement.
use anyhow::{Context, Result};
use candle_core::{Module, Tensor};
@@ -184,6 +188,42 @@ impl GatedDeltaNet {
self.state = GatedDeltaNetState::default();
}
/// Deep-copy the recurrent state for a prefix snapshot. Must be a
/// real copy (`Tensor::copy`), not a refcount clone: the CUDA
/// delta-rule kernels write the state buffer in place, so a
/// shared-storage snapshot would be corrupted by the next forward.
pub fn snapshot_state(&self) -> candle_core::Result<(Option<Tensor>, Option<Tensor>)> {
let conv = self
.state
.conv_state
.as_ref()
.map(Tensor::copy)
.transpose()?;
let rec = self
.state
.recurrent_state
.as_ref()
.map(Tensor::copy)
.transpose()?;
Ok((conv, rec))
}
/// Replace the live recurrent state with a deep copy of a
/// previously captured snapshot. Deep copy for the same in-place
/// kernel reason as [`Self::snapshot_state`] — the snapshot must
/// survive being restored more than once.
pub fn restore_state(
&mut self,
conv_state: Option<&Tensor>,
recurrent_state: Option<&Tensor>,
) -> candle_core::Result<()> {
self.state = GatedDeltaNetState {
conv_state: conv_state.map(Tensor::copy).transpose()?,
recurrent_state: recurrent_state.map(Tensor::copy).transpose()?,
};
Ok(())
}
/// `x` shape: `(B, L, hidden_size)`. Returns the same shape.
pub fn forward(&mut self, x: &Tensor) -> candle_core::Result<Tensor> {
let (batch_size, seq_len, _) = x.dims3()?;
@@ -357,6 +397,16 @@ pub(crate) fn run_delta_rule(
head_k_dim: usize,
head_v_dim: usize,
) -> candle_core::Result<(Tensor, Tensor)> {
// Prefill takes the chunk-parallel algorithm (#23): identical
// delta-rule math reorganised into per-chunk matmuls (cuBLAS /
// tensor cores on CUDA, gemm on CPU) instead of an O(L)-sequential
// per-token recurrence. Decode steps (seq_len 1) and short
// prompts stay on the recurrent paths below. The env kill switch
// exists for A/B measurement on the fleet.
const CHUNK_ALGO_THRESHOLD: usize = 64;
if seq_len >= CHUNK_ALGO_THRESHOLD && chunked_prefill_enabled() {
return run_chunk_gated_delta_rule(q, k, v, g, beta, state);
}
#[cfg(feature = "cuda")]
{
// Only dispatch to the kernel if the inputs are on a CUDA
@@ -371,6 +421,198 @@ pub(crate) fn run_delta_rule(
run_delta_rule_rust(q, k, v, g, beta, state, seq_len)
}
/// `NEURON_GDN_CHUNKED=0` falls back to the per-token recurrent
/// paths for prefill — kept for A/B measurement on live hosts.
fn chunked_prefill_enabled() -> bool {
static ENABLED: std::sync::OnceLock<bool> = std::sync::OnceLock::new();
*ENABLED.get_or_init(|| {
std::env::var("NEURON_GDN_CHUNKED")
.map(|v| v != "0" && !v.eq_ignore_ascii_case("false"))
.unwrap_or(true)
})
}
/// Chunk-parallel gated delta rule — a faithful port of the HF
/// reference `torch_chunk_gated_delta_rule` (chunk_size = 64) in
/// `transformers/models/qwen3_5/modeling_qwen3_5.py`, minus the steps
/// our caller has already done (q/k L2-norm, q pre-scaled by
/// `1/sqrt(D_k)`, inputs already `(B, H, L, D)` f32).
///
/// Same inputs/outputs as [`run_delta_rule`]'s recurrent paths:
/// `q`/`k`: `(B, H, L, D_k)`, `v`: `(B, H, L, D_v)`, `g`/`beta`:
/// `(B, H, L)`, `state`: `(B, H, D_k, D_v)` (zeros or a restored
/// prefix snapshot's recurrent state). Returns
/// `(out: (B, H, L, D_v), state: (B, H, D_k, D_v))`, all f32.
///
/// The reference's in-place UT-transform row loop is kept as-is
/// (with rows accumulating into a fresh tensor — candle tensors are
/// immutable); see the numerical-caution note at the loop for why the
/// tempting nilpotent-squaring shortcut is wrong. The parity tests
/// pin this against the recurrent path.
pub(crate) fn run_chunk_gated_delta_rule(
q: &Tensor,
k: &Tensor,
v: &Tensor,
g: &Tensor,
beta: &Tensor,
state: Tensor,
) -> candle_core::Result<(Tensor, Tensor)> {
const C: usize = 64;
let (b, h, l, dk) = q.dims4()?;
let dv = v.dim(3)?;
let device = q.device().clone();
// Pad L up to a multiple of the chunk size. Padded positions
// carry beta = 0 (no state update) and g = 0 (no decay), so they
// are inert in the recurrence; their outputs are sliced off at
// the end.
let pad = (C - l % C) % C;
let (q, k, v, g, beta) = if pad > 0 {
(
q.pad_with_zeros(2, 0, pad)?,
k.pad_with_zeros(2, 0, pad)?,
v.pad_with_zeros(2, 0, pad)?,
g.pad_with_zeros(2, 0, pad)?,
beta.pad_with_zeros(2, 0, pad)?,
)
} else {
(q.clone(), k.clone(), v.clone(), g.clone(), beta.clone())
};
let lt = l + pad;
let n = lt / C;
let beta_e = beta.unsqueeze(3)?; // (B, H, Lt, 1)
let v_beta = v.broadcast_mul(&beta_e)?;
let k_beta = k.broadcast_mul(&beta_e)?;
// Chunk reshape, flattening (B, H, N) into one batch dim — candle's
// matmul supports at most two batch dims, so the chunk-local math
// runs rank-3 over B·H·N and reshapes back to rank-5 for the
// inter-chunk loop's per-chunk narrows.
let bhn = b * h * n;
let q3 = q.reshape((bhn, C, dk))?;
let k3 = k.reshape((bhn, C, dk))?;
let k_beta3 = k_beta.reshape((bhn, C, dk))?;
let v_beta3 = v_beta.reshape((bhn, C, dv))?;
// Within-chunk cumulative log-decay.
let g3 = g.reshape((bhn, C))?.cumsum(1)?;
// Lower-triangular masks, broadcast over the batch dim.
let tril_incl = {
let mut m = vec![0f32; C * C];
for i in 0..C {
for j in 0..=i {
m[i * C + j] = 1.0;
}
}
Tensor::from_vec(m, (C, C), &device)?
};
let tril_strict = {
let mut m = vec![0f32; C * C];
for i in 0..C {
for j in 0..i {
m[i * C + j] = 1.0;
}
}
Tensor::from_vec(m, (C, C), &device)?
};
// decay_mask[i][j] = exp(g_i - g_j) on the lower triangle
// (diagonal = 1), zero above. Mask-multiply replaces the
// reference's tril/exp/tril dance: upper entries become
// exp(0) = 1 mid-way and are re-zeroed.
let g_col = g3.unsqueeze(2)?; // (BHN, C, 1)
let g_row = g3.unsqueeze(1)?; // (BHN, 1, C)
let decay_mask3 = g_col
.broadcast_sub(&g_row)?
.broadcast_mul(&tril_incl)?
.exp()?
.broadcast_mul(&tril_incl)?
.contiguous()?;
// T = strict lower of -((k_beta k^T) ⊙ decay), then
// M = (I - T)^{-1} by forward substitution over rows — the
// reference's in-place UT-transform loop, with processed rows
// accumulating in `done` instead of mutating in place.
//
// Numerical caution: T is nilpotent (T^64 = 0), so the inverse
// also equals Π (I + T^(2^j)) — six matmuls — but that form is
// numerically unsafe: raw powers of T grow combinatorially
// (path counts up to C(62,31) ≈ 4.6e17) before nilpotency
// collapses them, destroying f32 precision on real prompts with
// correlated keys. The forward substitution's intermediates are
// the convergent M entries themselves, matching the reference's
// behaviour exactly. Pinned by `chunked_ut_transform_survives_
// correlated_keys`.
let kkt = k_beta3.matmul(&k3.transpose(1, 2)?.contiguous()?)?;
let t = kkt
.broadcast_mul(&decay_mask3)?
.broadcast_mul(&tril_strict)?
.neg()?
.contiguous()?;
let eye = Tensor::eye(C, candle_core::DType::F32, &device)?;
// Row 0 of the strict-lower T is all zeros and passes through
// unchanged, seeding the processed-rows accumulator.
let mut done = t.narrow(1, 0, 1)?.contiguous()?;
for i in 1..C {
let row = t.narrow(1, i, 1)?; // (BHN, 1, C)
let coeffs = row.narrow(2, 0, i)?.contiguous()?; // (BHN, 1, i)
let updated = (&row + coeffs.matmul(&done)?)?; // (BHN, 1, C)
done = Tensor::cat(&[&done, &updated], 1)?;
}
let m = done.broadcast_add(&eye)?.contiguous()?;
// value' = M v_beta ; k_cumdecay = M (k_beta ⊙ exp(g)).
let value_c3 = m.matmul(&v_beta3.contiguous()?)?;
let g_exp3 = g3.exp()?.unsqueeze(2)?; // (BHN, C, 1)
let k_cumdecay3 = m.matmul(&k_beta3.broadcast_mul(&g_exp3)?.contiguous()?)?;
// Rank-5 views for the per-chunk narrows below.
let q = q3.reshape((b, h, n, C, dk))?;
let k = k3.reshape((b, h, n, C, dk))?;
let value_c = value_c3.reshape((b, h, n, C, dv))?;
let k_cumdecay = k_cumdecay3.reshape((b, h, n, C, dk))?;
let decay_mask = decay_mask3.reshape((b, h, n, C, C))?;
let g = g3.reshape((b, h, n, C))?;
// Inter-chunk recurrence: a handful of matmuls per 64 tokens.
let mut state = state.to_dtype(candle_core::DType::F32)?;
let mut outs: Vec<Tensor> = Vec::with_capacity(n);
for i in 0..n {
let q_i = q.narrow(2, i, 1)?.squeeze(2)?.contiguous()?; // (B, H, C, Dk)
let k_i = k.narrow(2, i, 1)?.squeeze(2)?.contiguous()?;
let v_i = value_c.narrow(2, i, 1)?.squeeze(2)?.contiguous()?; // (B, H, C, Dv)
let dm_i = decay_mask.narrow(2, i, 1)?.squeeze(2)?; // (B, H, C, C)
let g_i = g.narrow(2, i, 1)?.squeeze(2)?; // (B, H, C)
let kcd_i = k_cumdecay.narrow(2, i, 1)?.squeeze(2)?.contiguous()?;
let attn = q_i
.matmul(&k_i.transpose(2, 3)?.contiguous()?)?
.broadcast_mul(&dm_i)?
.contiguous()?;
let v_prime = kcd_i.matmul(&state)?;
let v_new = (v_i - v_prime)?.contiguous()?;
let g_i_exp = g_i.exp()?.unsqueeze(3)?; // (B, H, C, 1)
let attn_inter = q_i.broadcast_mul(&g_i_exp)?.contiguous()?.matmul(&state)?;
let out_i = (attn_inter + attn.matmul(&v_new)?)?;
outs.push(out_i.unsqueeze(2)?);
// state ← state · exp(g_last) + (k_i ⊙ exp(g_last - g_i))^T v_new
let g_last = g_i.narrow(2, C - 1, 1)?; // (B, H, 1)
let carry = g_last.exp()?.unsqueeze(3)?; // (B, H, 1, 1)
let w = k_i.broadcast_mul(&g_last.broadcast_sub(&g_i)?.exp()?.unsqueeze(3)?)?;
state =
(state.broadcast_mul(&carry)? + w.transpose(2, 3)?.contiguous()?.matmul(&v_new)?)?;
}
let out = Tensor::cat(&outs, 2)?
.reshape((b, h, lt, dv))?
.narrow(2, 0, l)?
.contiguous()?;
Ok((out, state))
}
/// CUDA path. Flattens (B, H, ...) → (BH, ...) at the kernel boundary
/// (the kernel uses BH = batch*heads as its outer batch axis) and
/// reshapes the kernel's outputs back to (B, H, ...) for the caller.
@@ -687,6 +929,151 @@ mod tests {
use super::*;
use candle_core::{DType, Device};
/// Plausible delta-rule inputs matching `run_delta_rule`'s
/// contract: q/k L2-normed (q pre-scaled by 1/sqrt(D_k)), g a
/// negative log-decay, beta in (0, 1). All f32 on CPU.
fn delta_rule_inputs(
b: usize,
h: usize,
l: usize,
dk: usize,
dv: usize,
) -> (Tensor, Tensor, Tensor, Tensor, Tensor) {
let dev = Device::Cpu;
let scale = 1.0 / (dk as f64).sqrt();
let q = Tensor::randn(0f32, 1.0, (b, h, l, dk), &dev).unwrap();
let q = (l2norm(&q, 1e-6).unwrap() * scale).unwrap();
let k = Tensor::randn(0f32, 1.0, (b, h, l, dk), &dev).unwrap();
let k = l2norm(&k, 1e-6).unwrap();
let v = (Tensor::randn(0f32, 1.0, (b, h, l, dv), &dev).unwrap() * 0.5).unwrap();
// g in (-1, 0): a realistic per-token log-decay.
let g = (Tensor::rand(0f32, 1f32, (b, h, l), &dev).unwrap() * -1.0).unwrap();
let beta = Tensor::rand(0.05f32, 0.95f32, (b, h, l), &dev).unwrap();
(q, k, v, g, beta)
}
fn max_abs_diff(a: &Tensor, b: &Tensor) -> f32 {
(a - b)
.unwrap()
.abs()
.unwrap()
.flatten_all()
.unwrap()
.max(0)
.unwrap()
.to_scalar::<f32>()
.unwrap()
}
/// The #23 parity gate: the chunk-parallel algorithm must produce
/// the same outputs and final state as the per-token recurrence.
/// L = 130 exercises the pad-to-chunk-multiple path (130 = 2×64 + 2).
#[test]
fn chunked_matches_recurrent_with_padding() {
let (b, h, l, dk, dv) = (1, 2, 130, 16, 16);
let (q, k, v, g, beta) = delta_rule_inputs(b, h, l, dk, dv);
let zeros = || Tensor::zeros((b, h, dk, dv), DType::F32, &Device::Cpu).unwrap();
let (out_rec, state_rec) = run_delta_rule_rust(&q, &k, &v, &g, &beta, zeros(), l).unwrap();
let (out_chk, state_chk) =
run_chunk_gated_delta_rule(&q, &k, &v, &g, &beta, zeros()).unwrap();
assert_eq!(out_chk.dims(), out_rec.dims());
let d_out = max_abs_diff(&out_rec, &out_chk);
let d_state = max_abs_diff(&state_rec, &state_chk);
assert!(d_out < 2e-4, "output diverged: {d_out}");
assert!(d_state < 2e-4, "final state diverged: {d_state}");
}
/// Exact chunk multiple (no padding) continuing from a non-zero
/// initial state — the prefix-cache-restore (#11) interaction.
#[test]
fn chunked_matches_recurrent_with_initial_state() {
let (b, h, dk, dv) = (1, 2, 16, 16);
let dev = Device::Cpu;
// Build a non-trivial initial state by running the recurrent
// path over a 50-token "restored prefix".
let (pq, pk, pv, pg, pbeta) = delta_rule_inputs(b, h, 50, dk, dv);
let zeros = Tensor::zeros((b, h, dk, dv), DType::F32, &dev).unwrap();
let (_, state0) = run_delta_rule_rust(&pq, &pk, &pv, &pg, &pbeta, zeros, 50).unwrap();
let l = 128;
let (q, k, v, g, beta) = delta_rule_inputs(b, h, l, dk, dv);
let (out_rec, state_rec) =
run_delta_rule_rust(&q, &k, &v, &g, &beta, state0.clone(), l).unwrap();
let (out_chk, state_chk) =
run_chunk_gated_delta_rule(&q, &k, &v, &g, &beta, state0).unwrap();
let d_out = max_abs_diff(&out_rec, &out_chk);
let d_state = max_abs_diff(&state_rec, &state_chk);
assert!(d_out < 2e-4, "output diverged: {d_out}");
assert!(d_state < 2e-4, "final state diverged: {d_state}");
}
/// Adversarially correlated inputs: near-identical keys with
/// beta ≈ 1 and negligible decay make the UT-transform matrix T
/// maximally coherent — raw powers of T grow combinatorially
/// (≈ C(62,31) paths), which destroyed f32 precision in the
/// nilpotent-squaring formulation this test exists to forbid.
/// Real prompts hit this through repetitive text (observed live
/// on beast: NaN logits → "!!!" replies). Forward substitution
/// must stay finite and match the recurrent path.
#[test]
fn chunked_ut_transform_survives_correlated_keys() {
let (b, h, l, dk, dv) = (1, 1, 192, 16, 16);
let dev = Device::Cpu;
let scale = 1.0 / (dk as f64).sqrt();
// One base direction plus a whisper of noise: every key is
// nearly the same unit vector.
let base = Tensor::randn(0f32, 1.0, (1, 1, 1, dk), &dev).unwrap();
let noise = (Tensor::randn(0f32, 1.0, (b, h, l, dk), &dev).unwrap() * 0.01).unwrap();
let k = l2norm(&base.broadcast_add(&noise).unwrap(), 1e-6).unwrap();
let q = (l2norm(&base.broadcast_add(&noise).unwrap(), 1e-6).unwrap() * scale).unwrap();
let v = (Tensor::randn(0f32, 1.0, (b, h, l, dv), &dev).unwrap() * 0.5).unwrap();
// Almost no decay, near-unit update rate — worst case for T.
let g = (Tensor::rand(0f32, 1f32, (b, h, l), &dev).unwrap() * -1e-3).unwrap();
let beta = Tensor::rand(0.98f32, 0.999f32, (b, h, l), &dev).unwrap();
let zeros = || Tensor::zeros((b, h, dk, dv), DType::F32, &dev).unwrap();
let (out_rec, state_rec) = run_delta_rule_rust(&q, &k, &v, &g, &beta, zeros(), l).unwrap();
let (out_chk, state_chk) =
run_chunk_gated_delta_rule(&q, &k, &v, &g, &beta, zeros()).unwrap();
let finite: Vec<f32> = out_chk.flatten_all().unwrap().to_vec1().unwrap();
assert!(
finite.iter().all(|x| x.is_finite()),
"chunked output not finite on correlated inputs"
);
let d_out = max_abs_diff(&out_rec, &out_chk);
let d_state = max_abs_diff(&state_rec, &state_chk);
assert!(
d_out < 5e-3,
"output diverged on correlated inputs: {d_out}"
);
assert!(
d_state < 5e-3,
"final state diverged on correlated inputs: {d_state}"
);
}
/// A single exact chunk — the smallest input the dispatch sends to
/// the chunked path.
#[test]
fn chunked_matches_recurrent_single_chunk() {
let (b, h, l, dk, dv) = (2, 3, 64, 8, 8);
let (q, k, v, g, beta) = delta_rule_inputs(b, h, l, dk, dv);
let zeros = || Tensor::zeros((b, h, dk, dv), DType::F32, &Device::Cpu).unwrap();
let (out_rec, state_rec) = run_delta_rule_rust(&q, &k, &v, &g, &beta, zeros(), l).unwrap();
let (out_chk, state_chk) =
run_chunk_gated_delta_rule(&q, &k, &v, &g, &beta, zeros()).unwrap();
let d_out = max_abs_diff(&out_rec, &out_chk);
let d_state = max_abs_diff(&state_rec, &state_chk);
assert!(d_out < 2e-4, "output diverged: {d_out}");
assert!(d_state < 2e-4, "final state diverged: {d_state}");
}
#[test]
fn softplus_small_x() {
// softplus(0) = ln(2) ≈ 0.6931
@@ -737,6 +1124,8 @@ mod tests {
rope_theta: 10000.0,
partial_rotary_factor: 1.0,
rope_type: None,
mrope_section: Vec::new(),
mrope_interleaved: false,
},
rms_norm_eps: 1e-6,
tie_word_embeddings: false,

View File

@@ -78,6 +78,8 @@ pub mod linear_attn;
pub mod mlp;
pub mod rmsnorm;
pub mod rope;
pub mod snapshot;
pub mod vision;
use decoder::Qwen3_5DecoderLayer;
use rmsnorm::Qwen3_5RmsNorm;
@@ -99,6 +101,20 @@ pub struct Config {
pub model_type: String,
/// The text-side hyperparameters. Everything we actually need.
pub text_config: TextConfig,
/// Vision tower hyperparameters. Present on multimodal
/// checkpoints (e.g. Qwen/Qwen3.6-27B); absent on text-only
/// variants. When present, `Qwen3_5ForCausalLM::new` loads the
/// vision tower alongside the language model so vision-bearing
/// requests can splice image embeddings at `<|image_pad|>` token
/// positions.
#[serde(default)]
pub vision_config: Option<vision::VisionConfig>,
/// Token id the chat template emits per image patch group.
/// Mirrors the LM tokenizer's `<|image_pad|>` id (248056 for
/// Qwen3.6). The runtime locates these in the prompt and splices
/// in `VisionTower::forward` output. `None` for text-only models.
#[serde(default)]
pub image_token_id: Option<u32>,
}
/// Inner config (the `text_config` block). Mirrors the Qwen3 layout
@@ -176,11 +192,12 @@ fn default_hidden_act() -> String {
}
/// Nested `rope_parameters` block from a Qwen3-Next `config.json`.
/// `mrope_section` and `mrope_interleaved` are accepted via the
/// `#[serde(default)]` flatten-tolerance below but ignored — we treat
/// MRoPE as plain RoPE for text-only inference (the three position
/// grids carry identical ids when there's no vision input, so the
/// interleaving is a no-op).
///
/// For text-only inference the three MRoPE position grids carry
/// identical ids, so the interleave is a no-op and plain RoPE applies.
/// For vision inputs `mrope_section` + `mrope_interleaved` drive the
/// per-axis (text/height/width) rotary used by image tokens — see
/// `rope.rs`.
#[derive(Debug, Clone, Deserialize)]
pub struct RopeParameters {
/// Base for the inverse-frequency computation. Qwen3.6: 10_000_000.
@@ -196,6 +213,16 @@ pub struct RopeParameters {
/// implemented here.
#[serde(default)]
pub rope_type: Option<String>,
/// MRoPE per-axis section sizes `[text, height, width]` — e.g.
/// `[11, 11, 10]` for Qwen3.6, summing to the rotary half-dim.
/// Empty for models that don't declare MRoPE (→ plain RoPE).
#[serde(default)]
pub mrope_section: Vec<usize>,
/// Whether the three MRoPE axes are interleaved per-frequency
/// (Qwen3-VL / Qwen3.6 style, `true`) rather than block-concatenated
/// (Qwen2-VL style, `false`).
#[serde(default)]
pub mrope_interleaved: bool,
}
fn default_rope_theta() -> f64 {
@@ -206,6 +233,80 @@ fn default_partial_rotary_factor() -> f32 {
1.0
}
/// Splice rows from `img` into `h` at `positions`. Stage B helper.
///
/// `h`: `(1, L, hidden)` — the LM's input embedding tensor after
/// `embed_tokens.forward`.
/// `img`: `(N_img, hidden)` — image embeddings, one row per
/// `<|image_pad|>` token in the prompt. Must already be in `h.dtype()`.
/// `positions`: indices into the `L` axis where image rows go;
/// `positions.len() == N_img`.
///
/// Approach: group `positions` into contiguous runs (because the chat
/// template emits `<|vision_start|><|image_pad|>×N<|vision_end|>` —
/// the pad tokens for each image land in one contiguous span), then
/// `slice_assign` per run. For typical Qwen3.6 requests this is one
/// or two runs per image; `slice_assign` does one tensor copy per
/// run, which is cheap relative to the decoder forward pass.
pub(crate) fn splice_runs(
h: &Tensor,
img: &Tensor,
positions: &[u32],
) -> candle_core::Result<Tensor> {
debug_assert!(
!positions.is_empty(),
"splice_runs precondition: non-empty positions"
);
let hidden = h.dim(2)?;
let mut out = h.clone();
let mut img_offset = 0_usize;
let mut run_start = positions[0] as usize;
let mut run_end_exclusive = run_start + 1;
for &p in &positions[1..] {
let p = p as usize;
if p == run_end_exclusive {
run_end_exclusive = p + 1;
} else {
apply_run(
&mut out,
img,
&mut img_offset,
run_start,
run_end_exclusive,
hidden,
)?;
run_start = p;
run_end_exclusive = p + 1;
}
}
apply_run(
&mut out,
img,
&mut img_offset,
run_start,
run_end_exclusive,
hidden,
)?;
Ok(out)
}
fn apply_run(
out: &mut Tensor,
img: &Tensor,
img_offset: &mut usize,
run_start: usize,
run_end_exclusive: usize,
hidden: usize,
) -> candle_core::Result<()> {
let run_len = run_end_exclusive - run_start;
let slice = img
.narrow(0, *img_offset, run_len)?
.reshape((1, run_len, hidden))?;
*out = out.slice_assign(&[0..1, run_start..run_end_exclusive, 0..hidden], &slice)?;
*img_offset += run_len;
Ok(())
}
/// Qwen3-Next base transformer (embedding + decoder stack + final
/// norm). Public so a TP variant in `harness/tp/tp_qwen3_5.rs` can
/// also build on it later — for now only `Qwen3_5ForCausalLM` is the
@@ -214,6 +315,16 @@ pub struct Qwen3_5Model {
embed_tokens: Embedding,
layers: Vec<Qwen3_5DecoderLayer>,
norm: Qwen3_5RmsNorm,
/// Shared with every full-attention layer; the model uses it to
/// build the per-forward cos/sin (interleaved M-RoPE for image
/// tokens, plain for text) once, which the layers then apply.
rotary: Arc<RotaryEmbedding>,
/// `offset + rope_delta` is the text-axis position during decode.
/// 0 for text-only; set from `get_rope_index` during a vision
/// prefill (image tokens compress the position space, so text after
/// the image resumes from a smaller counter than the sequence
/// index). Reset in `clear_kv_cache`.
rope_delta: i64,
device: Device,
dtype: DType,
}
@@ -265,6 +376,8 @@ impl Qwen3_5Model {
embed_tokens,
layers,
norm,
rotary,
rope_delta: 0,
device,
dtype,
})
@@ -278,6 +391,45 @@ impl Qwen3_5Model {
for l in &mut self.layers {
l.clear_kv_cache();
}
// New request → no image-compressed position offset until the
// next vision prefill sets one.
self.rope_delta = 0;
}
/// Capture every layer's cache state plus the rope position
/// counter as one consistent prefix snapshot (#11). Only valid at
/// a token boundary — i.e. between forward calls, which is the
/// only time the caller can reach this anyway.
pub fn snapshot_kv_cache(&self) -> candle_core::Result<snapshot::KvCacheSnapshot> {
let layers = self
.layers
.iter()
.map(|l| l.snapshot_kv())
.collect::<candle_core::Result<Vec<_>>>()?;
Ok(snapshot::KvCacheSnapshot {
layers,
rope_delta: self.rope_delta,
})
}
/// Replace the live cache state with a previously captured
/// snapshot. The snapshot stays valid for further restores.
pub fn restore_kv_cache(
&mut self,
snap: &snapshot::KvCacheSnapshot,
) -> candle_core::Result<()> {
if snap.layers.len() != self.layers.len() {
candle_core::bail!(
"restore_kv_cache: snapshot has {} layers, model has {}",
snap.layers.len(),
self.layers.len()
);
}
for (layer, layer_snap) in self.layers.iter_mut().zip(snap.layers.iter()) {
layer.restore_kv(layer_snap)?;
}
self.rope_delta = snap.rope_delta;
Ok(())
}
fn causal_mask(&self, b: usize, tgt: usize, offset: usize) -> candle_core::Result<Tensor> {
@@ -289,8 +441,141 @@ impl Qwen3_5Model {
}
pub fn forward(&mut self, input: &Tensor, offset: usize) -> candle_core::Result<Tensor> {
self.forward_inner(input, offset, None, None, &[], None)
}
/// Forward for a vision-prefill chunk: optional image-embedding
/// splice plus explicit interleaved-M-RoPE `position_ids` (the
/// chunk's slice of the full prompt's 3D positions). Mirrors the TP
/// `TpQwen3_5Model::forward_with_positions` — used by
/// `Qwen3_5ForCausalLM::prefill_with_images_chunked`, which computes
/// the positions once over the whole prompt and slices them per
/// chunk so the position counters stay consistent across chunk
/// boundaries (an image compresses the position space, so per-chunk
/// offset arithmetic would be wrong).
pub fn forward_with_positions(
&mut self,
input: &Tensor,
offset: usize,
position_ids: &Tensor,
image_embeds: Option<&Tensor>,
image_token_id: Option<u32>,
) -> candle_core::Result<Tensor> {
self.forward_inner(
input,
offset,
image_embeds,
image_token_id,
&[],
Some(position_ids),
)
}
/// Forward with image-embedding splice. Stage B of the vision plan.
///
/// `input_ids`: `(1, L)` token ids — same shape the text-only
/// `forward` accepts (single-batch; multi-batch vision is not in
/// scope today).
/// `image_embeds`: `(N_image_tokens, hidden_size)` — concatenation
/// of every image's post-merger embedding (`VisionTower::forward`
/// output), in the same order images appear in the input. The
/// caller has already done the per-image patch-count expansion of
/// `<|image_pad|>` tokens in `input_ids`, so `N_image_tokens`
/// equals the number of `image_token_id` positions in `input_ids`.
/// `image_token_id`: the sentinel token (e.g. 248056 for Qwen3.6).
///
/// The splice replaces the LM's text-side embedding at each
/// `image_token_id` position with the corresponding row from
/// `image_embeds`. After the splice the decoder runs the interleaved
/// M-RoPE path: `grids` carries each image's post-merge LM grid
/// `(lm_gh, lm_gw)` so `get_rope_index` assigns image tokens their 2D
/// coordinates (dynamic resolution, #14).
pub fn forward_with_vision(
&mut self,
input_ids: &Tensor,
offset: usize,
image_embeds: &Tensor,
image_token_id: u32,
grids: &[(usize, usize)],
) -> candle_core::Result<Tensor> {
self.forward_inner(
input_ids,
offset,
Some(image_embeds),
Some(image_token_id),
grids,
None,
)
}
/// Shared forward. Splices image embeddings at `image_token_id`
/// positions when present, then builds the rotary cos/sin, in
/// precedence order: explicit `position_ids` (interleaved M-RoPE,
/// the chunked-vision path that slices a once-computed position
/// tensor) > internal M-RoPE from `grids` (single-shot vision) >
/// plain positions at `offset + rope_delta` (text / decode).
fn forward_inner(
&mut self,
input: &Tensor,
offset: usize,
image_embeds: Option<&Tensor>,
image_token_id: Option<u32>,
grids: &[(usize, usize)],
position_ids: Option<&Tensor>,
) -> candle_core::Result<Tensor> {
let (b, l) = input.dims2()?;
let mut h = self.embed_tokens.forward(input)?;
// Splice image embeddings at `image_token_id` positions, when
// this forward carries any. Independent of how cos/sin is built.
if let (Some(img), Some(tok_id)) = (image_embeds, image_token_id) {
let ids: Vec<u32> = input.flatten_all()?.to_vec1()?;
let mut positions: Vec<u32> = Vec::with_capacity(img.dim(0)?);
for (idx, id) in ids.iter().enumerate() {
if *id == tok_id {
positions.push(idx as u32);
}
}
let n_img_tokens = img.dim(0)?;
if positions.len() != n_img_tokens {
candle_core::bail!(
"forward_with_vision: chunk has {} image-token positions but \
image_embeds carries {} tokens — per-image patch-count expansion \
/ chunk slicing mismatch",
positions.len(),
n_img_tokens,
);
}
if !positions.is_empty() {
// Cast image_embeds to the LM's dtype, then splice the
// contiguous `<|image_pad|>` runs in place.
let img = img.to_dtype(self.dtype)?;
h = splice_runs(&h, &img, &positions)?;
}
}
// Build interleaved M-RoPE cos/sin so image tokens carry their
// 2D (lm_gh × lm_gw) grid coordinates. Text / decode take the
// plain-RoPE fast path — bit-for-bit the pre-M-RoPE behaviour
// when `rope_delta == 0`.
let (cos, sin) = if let Some(pos) = position_ids {
// Pre-computed positions sliced for this chunk — the splice
// above already advanced `rope_delta`'s effect into `pos`.
self.rotary.mrope_cos_sin(pos)?
} else if let Some(tok_id) = image_token_id {
// Single-shot vision: compute the whole prompt's M-RoPE here
// and stash `rope_delta` for the decode that follows.
let ids: Vec<u32> = input.flatten_all()?.to_vec1()?;
let (text, height, width, delta) = rope::get_rope_index(&ids, tok_id, grids)
.map_err(|e| candle_core::Error::Msg(format!("get_rope_index: {e}")))?;
self.rope_delta = delta;
let pos = rope::mrope_position_tensor(&text, &height, &width, &self.device)?;
self.rotary.mrope_cos_sin(&pos)?
} else {
let base = (offset as i64 + self.rope_delta).max(0) as usize;
self.rotary.plain_cos_sin(base, l)?
};
// Causal mask only needed for L > 1 prefill; full-attention
// layers consume it via broadcast_add. Linear-attention layers
// ignore the mask.
@@ -300,7 +585,7 @@ impl Qwen3_5Model {
Some(self.causal_mask(b, l, offset)?)
};
for layer in &mut self.layers {
h = layer.forward(&h, causal.as_ref(), offset)?;
h = layer.forward(&h, causal.as_ref(), &cos, &sin)?;
}
self.norm.forward(&h)
}
@@ -309,6 +594,15 @@ impl Qwen3_5Model {
pub struct Qwen3_5ForCausalLM {
base: Qwen3_5Model,
lm_head: Linear,
/// Vision tower (Stage A4). `None` for text-only checkpoints or
/// when the operator has opted out. When present, the harness's
/// `Job::EncodeImage` dispatch path runs `vision.forward(image)`
/// and the LM forward (Stage B) splices the result at
/// `image_token_id` positions in the input embedding stream.
vision: Option<vision::VisionTower>,
/// Mirrors `Config::image_token_id`. Cached here so the runtime
/// doesn't have to round-trip through the parsed config struct.
image_token_id: Option<u32>,
}
impl Qwen3_5ForCausalLM {
@@ -324,7 +618,52 @@ impl Qwen3_5ForCausalLM {
.with_context(|| format!("load '{}/lm_head/weight'", vb.prefix()))?;
Linear::new(weight, None)
};
Ok(Self { base, lm_head })
// Stage A4: load the vision tower when the config carries a
// `vision_config` block and the safetensors actually carry
// `model.visual.*` weights. The `Option<VisionConfig>` on the
// config makes this a single-source-of-truth decision —
// text-only checkpoints just leave `vision_config` unset and
// get `None` here without any extra plumbing.
let vision = if let Some(vcfg) = config.vision_config.clone() {
tracing::info!(
depth = vcfg.depth,
hidden_size = vcfg.hidden_size,
"loading qwen3_5 vision tower"
);
Some(
vision::VisionTower::load(vcfg, vb.pp("model.visual"))
.context("load qwen3_5 vision tower (model.visual.*)")?,
)
} else {
None
};
Ok(Self {
base,
lm_head,
vision,
image_token_id: config.image_token_id,
})
}
/// True when this checkpoint loaded a vision tower. Used by the
/// HTTP layer to advertise vision capability in `/v1/models` and
/// to reject image-bearing requests against text-only loads with
/// a clean 400.
pub fn has_vision(&self) -> bool {
self.vision.is_some()
}
/// Vision tower handle, if loaded. The device-worker
/// `EncodeImage` job dispatches to `vision.forward(image)`.
pub fn vision(&self) -> Option<&vision::VisionTower> {
self.vision.as_ref()
}
/// `<|image_pad|>` token id from `config.json`, when known.
/// The Stage B prompt-builder uses this to count expansion targets
/// and the LM forward uses it to locate splice positions.
pub fn image_token_id(&self) -> Option<u32> {
self.image_token_id
}
/// `input`: token-id tensor of shape `(B, L)`. Returns logits at
@@ -337,9 +676,192 @@ impl Qwen3_5ForCausalLM {
hidden.i((.., l - 1.., ..))?.apply(&self.lm_head)
}
/// Stage B: forward with image-embedding splice. Mirrors `forward`
/// but routes through `Qwen3_5Model::forward_with_vision` so the
/// LM's input embeddings get the image patches spliced in at
/// `image_token_id` positions before the decoder stack runs.
pub fn forward_with_vision(
&mut self,
input: &Tensor,
offset: usize,
image_embeds: &Tensor,
image_token_id: u32,
grids: &[(usize, usize)],
) -> candle_core::Result<Tensor> {
let (_, l) = input.dims2()?;
let hidden =
self.base
.forward_with_vision(input, offset, image_embeds, image_token_id, grids)?;
hidden.i((.., l - 1.., ..))?.apply(&self.lm_head)
}
/// Forward for a vision-prefill chunk: explicit M-RoPE positions +
/// optional image splice. Mirrors `forward_with_vision` but routes
/// through `Qwen3_5Model::forward_with_positions`. Used by
/// [`Self::prefill_with_images_chunked`].
pub fn forward_with_positions(
&mut self,
input: &Tensor,
offset: usize,
position_ids: &Tensor,
image_embeds: Option<&Tensor>,
image_token_id: Option<u32>,
) -> candle_core::Result<Tensor> {
let (_, l) = input.dims2()?;
let hidden = self.base.forward_with_positions(
input,
offset,
position_ids,
image_embeds,
image_token_id,
)?;
hidden.i((.., l - 1.., ..))?.apply(&self.lm_head)
}
/// Encode every preprocessed `(C, H, W)` image once through the
/// vision tower and concatenate along the patch axis →
/// `(sum_patches, hidden)`. Done once per prefill, not per chunk.
fn encode_images_concat(&self, image_pixels: &[Tensor]) -> candle_core::Result<Tensor> {
let tower = self.vision.as_ref().ok_or_else(|| {
candle_core::Error::Msg(
"encode_images_concat: loaded without a vision tower \
(config.json::vision_config absent or weights missing)"
.into(),
)
})?;
let mut per_image = Vec::with_capacity(image_pixels.len());
for (idx, img) in image_pixels.iter().enumerate() {
let embed = tower
.forward(img)
.map_err(|e| candle_core::Error::Msg(format!("encode image[{idx}]: {e:#}")))?;
per_image.push(embed);
}
Tensor::cat(&per_image.iter().collect::<Vec<_>>(), 0)
}
/// Chunked image prefill for the single-GPU path (#18) — parity with
/// `TpQwen3_5ForCausalLM::prefill_with_images_chunked`. Encodes the
/// image(s) once, then walks the (pre-expanded) prompt in
/// `chunk_size`-token windows — exactly like the text
/// `chunked_prefill_*` paths — splicing the patch embeddings into
/// whichever chunk(s) carry `<|image_pad|>` positions. Activation
/// memory is bounded by the chunk, not the full prompt, so a long
/// vision context no longer single-shot-OOMs.
///
/// The KV cache (and GDN recurrent state) accumulate across chunks
/// via the growing offset — the same per-chunk associativity the
/// text chunked prefill and prefix cache (#11/#23) rely on. Only the
/// final chunk's last-position logits are returned; intermediate
/// chunks just populate the cache. The caller is responsible for
/// clearing the cache first.
///
/// `base_offset` is the KV position the prefill starts at (0 for a
/// fresh request). `image_pixels` are device-resident `(C, H, W)`
/// tensors; grids and the interleaved-M-RoPE position ids are
/// recomputed here so an image's position compression is consistent
/// across chunk boundaries.
pub fn prefill_with_images_chunked(
&mut self,
tokens: &[u32],
base_offset: usize,
image_pixels: &[Tensor],
image_token_id: u32,
chunk_size: usize,
) -> candle_core::Result<Tensor> {
if image_pixels.is_empty() {
candle_core::bail!("prefill_with_images_chunked: called with zero images");
}
if tokens.is_empty() {
candle_core::bail!("prefill_with_images_chunked: empty prompt");
}
let chunk_size = chunk_size.max(1);
let device = self.base.device.clone();
let image_embeds = self.encode_images_concat(image_pixels)?;
// Each image's LM grid (lm_gh, lm_gw) = (h/factor, w/factor),
// factor = patch×merge — recomputed from the pixel tensors (#14
// dynamic resolution).
let factor = self
.vision
.as_ref()
.map(|v| {
let c = v.config();
c.patch_size * c.spatial_merge_size
})
.ok_or_else(|| {
candle_core::Error::Msg(
"prefill_with_images_chunked: loaded without a vision tower".into(),
)
})?;
let grids: Vec<(usize, usize)> = image_pixels
.iter()
.map(|t| {
let (_, h, w) = t.dims3()?;
Ok::<(usize, usize), candle_core::Error>((h / factor, w / factor))
})
.collect::<candle_core::Result<Vec<_>>>()?;
// Interleaved-M-RoPE 3D positions for the whole prompt, computed
// once and sliced per chunk so image tokens get their grid
// coordinates and text after an image resumes from the
// compressed counter. `rope_delta` is stashed on the base model
// for the decode that follows this prefill.
let (text, height, width, delta) = rope::get_rope_index(tokens, image_token_id, &grids)
.map_err(|e| candle_core::Error::Msg(format!("get_rope_index: {e}")))?;
self.base.rope_delta = delta;
let full_pos = rope::mrope_position_tensor(&text, &height, &width, &device)?;
let mut last_logits: Option<Tensor> = None;
// Rows of `image_embeds` already spliced by earlier chunks. The
// `<|image_pad|>` run is contiguous, so chunks consume embedding
// rows in order.
let mut img_off = 0usize;
let mut start = 0usize;
while start < tokens.len() {
let end = (start + chunk_size).min(tokens.len());
let chunk = &tokens[start..end];
let input = Tensor::new(chunk, &device)?.unsqueeze(0)?;
let pos_slice = full_pos.narrow(1, start, end - start)?;
let n_here = chunk.iter().filter(|&&t| t == image_token_id).count();
let logits = if n_here == 0 {
self.forward_with_positions(&input, base_offset + start, &pos_slice, None, None)?
} else {
// Splice the next `n_here` patch rows at this chunk's
// local image-pad positions.
let rows = image_embeds.narrow(0, img_off, n_here)?;
img_off += n_here;
self.forward_with_positions(
&input,
base_offset + start,
&pos_slice,
Some(&rows),
Some(image_token_id),
)?
};
last_logits = Some(logits);
start = end;
}
last_logits
.ok_or_else(|| candle_core::Error::Msg("prefill_with_images_chunked: no chunks".into()))
}
pub fn clear_kv_cache(&mut self) {
self.base.clear_kv_cache();
}
/// See [`Qwen3_5Model::snapshot_kv_cache`].
pub fn snapshot_kv_cache(&self) -> candle_core::Result<snapshot::KvCacheSnapshot> {
self.base.snapshot_kv_cache()
}
/// See [`Qwen3_5Model::restore_kv_cache`].
pub fn restore_kv_cache(
&mut self,
snap: &snapshot::KvCacheSnapshot,
) -> candle_core::Result<()> {
self.base.restore_kv_cache(snap)
}
}
#[cfg(test)]
@@ -394,4 +916,50 @@ mod tests {
assert_eq!(cfg.text_config.rope_parameters.rope_theta, 10_000_000.0);
assert!((cfg.text_config.rope_parameters.partial_rotary_factor - 0.25).abs() < 1e-6);
}
/// `splice_runs` replaces (1, L, H) embedding rows at the given
/// positions with rows from a (N_img, H) image-embedding tensor,
/// in the order positions are supplied.
#[test]
fn splice_runs_replaces_at_contiguous_positions() {
use candle_core::{DType, Device};
let dev = Device::Cpu;
// (1, L=5, H=2) text embeddings — encoded as floats so the
// assertion can spot the change without dtype conversion.
let h_vals: Vec<f32> = vec![
10., 11., // pos 0
20., 21., // pos 1
30., 31., // pos 2
40., 41., // pos 3
50., 51., // pos 4
];
let h = Tensor::from_vec(h_vals, (1, 5, 2), &dev).unwrap();
// Two image embeddings to splice at positions 1 and 2 (a
// contiguous run — single image emitting two patch tokens).
let img_vals: Vec<f32> = vec![-1., -2., -3., -4.];
let img = Tensor::from_vec(img_vals, (2, 2), &dev).unwrap();
let out = splice_runs(&h, &img, &[1, 2]).unwrap();
let flat: Vec<f32> = out.flatten_all().unwrap().to_vec1().unwrap();
assert_eq!(flat, vec![10., 11., -1., -2., -3., -4., 40., 41., 50., 51.]);
let _ = DType::F32;
}
/// Non-contiguous positions: two images at positions [1] and [3]
/// each contributing one patch. `splice_runs` should iterate
/// runs and place the corresponding image rows.
#[test]
fn splice_runs_handles_non_contiguous_runs() {
use candle_core::Device;
let dev = Device::Cpu;
let h_vals: Vec<f32> = vec![1., 1., 2., 2., 3., 3., 4., 4., 5., 5.];
let h = Tensor::from_vec(h_vals, (1, 5, 2), &dev).unwrap();
let img_vals: Vec<f32> = vec![-1., -2., -3., -4.];
let img = Tensor::from_vec(img_vals, (2, 2), &dev).unwrap();
let out = splice_runs(&h, &img, &[1, 3]).unwrap();
let flat: Vec<f32> = out.flatten_all().unwrap().to_vec1().unwrap();
assert_eq!(flat, vec![1., 1., -1., -2., 3., 3., -3., -4., 5., 5.]);
}
}

View File

@@ -1,19 +1,27 @@
//! Rotary position embedding for Qwen3-Next's full-attention layers.
//!
//! Qwen3.6 ships with MRoPE (multimodal RoPE) machinery in the
//! reference Python — three position grids interleaved per
//! `mrope_section`. For text-only inference all three grids carry the
//! same position ids and the interleave is a no-op, so this module
//! implements the plain (non-mrope) flavour: the standard inv_freq
//! cosine/sine tables driven by `rope_theta` and `head_dim`.
//! Qwen3.6 declares **interleaved M-RoPE** (multimodal RoPE): the
//! rotary half-dimension is split across three position axes —
//! `[text, height, width]` per `mrope_section` (`[11,11,10]` for
//! Qwen3.6) — interleaved per-frequency. For **text** every token's
//! three axes carry the same position id, so the interleave is a no-op
//! and this reduces exactly to plain RoPE. For **image** tokens the
//! height/width axes carry the patch's 2D grid coordinates, which is
//! how the model reads the 14×14 patch layout (without it, all patches
//! share a height position and the image reads as vertical repetition).
//!
//! Rotation flavour: **GLM-style** rotate-half (the second half of the
//! head dim is negated and swapped into the first). The reference
//! Python uses `apply_rotary_pos_emb` with `rotate_half`; candle's
//! `rope_slow` is the matching helper.
//! Two cos/sin builders feed a shared [`RotaryEmbedding::apply`]:
//! - [`RotaryEmbedding::plain_cos_sin`] narrows the precomputed tables
//! at a scalar position — the text / decode fast path.
//! - [`RotaryEmbedding::mrope_cos_sin`] builds per-token cos/sin from a
//! `(3, seq)` position-id tensor, blending the three axes' frequencies
//! at the interleave index sets — the vision-prefill path.
//!
//! Rotation flavour: **GLM-style** rotate-half (candle's `rope_slow`),
//! matching the reference Python's `apply_rotary_pos_emb` + `rotate_half`.
use anyhow::Result;
use candle_core::{DType, Device, Tensor};
use candle_core::{DType, Device, IndexOp, Tensor};
use super::TextConfig;
@@ -21,6 +29,18 @@ use super::TextConfig;
pub struct RotaryEmbedding {
sin: Tensor,
cos: Tensor,
/// Inverse frequencies, shape `(1, rotary_dim/2)`. Retained (beyond
/// the precomputed `sin`/`cos` tables) so [`Self::mrope_cos_sin`] can
/// build cos/sin from arbitrary per-axis position ids.
inv_freq: Tensor,
/// Per-axis column masks over the rotary half-dim, shape `(1, half)`,
/// f32 0/1. `mask_t + mask_h + mask_w` partitions the columns; a
/// column belongs to exactly one axis. For a non-MRoPE config
/// `mask_t` is all-ones and the others all-zero (→ plain RoPE).
mask_t: Tensor,
mask_h: Tensor,
mask_w: Tensor,
dtype: DType,
/// Number of dims at the head's leading edge that the rotation
/// covers. The remaining `head_dim - rotary_dim` dims pass through
/// unchanged. Qwen3-Next uses `partial_rotary_factor = 0.25`, so
@@ -29,6 +49,52 @@ pub struct RotaryEmbedding {
head_dim: usize,
}
/// Build the per-axis 0/1 column masks over the rotary half-dim from
/// `mrope_section`. Returns `(temporal, height, width)` each length
/// `half`. Temporal is the complement of height width, so the three
/// masks always partition `0..half` and reduce to all-temporal (plain
/// RoPE) when no usable section is given.
fn mrope_masks(
half: usize,
section: &[usize],
interleaved: bool,
) -> (Vec<f32>, Vec<f32>, Vec<f32>) {
let mut mh = vec![0f32; half];
let mut mw = vec![0f32; half];
if section.len() == 3 {
if interleaved {
// Qwen3-VL: height at columns 1,4,7,… ; width at 2,5,8,… ;
// temporal keeps 0,3,6,… — each `take`n from `mrope_section`.
for i in (1..half).step_by(3).take(section[1]) {
mh[i] = 1.0;
}
for i in (2..half).step_by(3).take(section[2]) {
mw[i] = 1.0;
}
} else {
// Qwen2-VL: contiguous blocks [text | height | width].
let h_start = section[0].min(half);
let h_end = (section[0] + section[1]).min(half);
for m in mh.iter_mut().take(h_end).skip(h_start) {
*m = 1.0;
}
for m in mw.iter_mut().take(half).skip(h_end) {
*m = 1.0;
}
}
}
let mt: Vec<f32> = (0..half)
.map(|i| {
if mh[i] == 0.0 && mw[i] == 0.0 {
1.0
} else {
0.0
}
})
.collect();
(mt, mh, mw)
}
impl RotaryEmbedding {
pub fn new(dtype: DType, cfg: &TextConfig, dev: &Device) -> Result<Self> {
let head_dim = cfg.head_dim;
@@ -52,44 +118,88 @@ impl RotaryEmbedding {
.step_by(2)
.map(|i| 1f32 / rope.rope_theta.powf(i as f64 / rotary_dim as f64) as f32)
.collect();
let n = inv_freq.len();
let inv_freq = Tensor::from_vec(inv_freq, (1, n), dev)?.to_dtype(DType::F32)?;
let half = inv_freq.len();
let inv_freq = Tensor::from_vec(inv_freq, (1, half), dev)?.to_dtype(DType::F32)?;
let t = Tensor::arange(0u32, max_seq_len as u32, dev)?
.to_dtype(DType::F32)?
.reshape((max_seq_len, 1))?;
let freqs = t.matmul(&inv_freq)?;
// MRoPE axis masks. `sum(mrope_section)` should equal `half`;
// warn-tolerant: any shortfall just stays on the temporal axis.
let (mt, mh, mw) = mrope_masks(half, &rope.mrope_section, rope.mrope_interleaved);
let mask_t = Tensor::from_vec(mt, (1, half), dev)?;
let mask_h = Tensor::from_vec(mh, (1, half), dev)?;
let mask_w = Tensor::from_vec(mw, (1, half), dev)?;
Ok(Self {
sin: freqs.sin()?.to_dtype(dtype)?,
cos: freqs.cos()?.to_dtype(dtype)?,
inv_freq,
mask_t,
mask_h,
mask_w,
dtype,
rotary_dim,
head_dim,
})
}
/// Apply RoPE to q, k.
///
/// `q`, `k` shape: `(B, H, L, head_dim)`. `offset` is the index
/// into the cached cos/sin table — the position of the first token
/// in the current step.
///
/// When `rotary_dim < head_dim` the rotation is applied only to the
/// first `rotary_dim` dims of each head; the tail passes through
/// unchanged (matches the reference Python's
/// `apply_rotary_pos_emb` with non-trivial `partial_rotary_factor`).
pub fn apply(
/// cos/sin for a contiguous run of `seq_len` positions starting at
/// `pos`, by narrowing the precomputed tables. The text / decode
/// path (all three MRoPE axes equal → plain RoPE). Shape
/// `(seq_len, rotary_dim/2)`.
pub fn plain_cos_sin(
&self,
pos: usize,
seq_len: usize,
) -> candle_core::Result<(Tensor, Tensor)> {
let cos = self.cos.narrow(0, pos, seq_len)?;
let sin = self.sin.narrow(0, pos, seq_len)?;
Ok((cos, sin))
}
/// cos/sin from explicit per-token 3D position ids, shape
/// `(3, seq_len)` (axes: text, height, width). Builds each axis's
/// frequencies and blends them at the interleave index sets, so
/// every rotary frequency slot is driven by exactly one axis.
/// Reduces exactly to [`Self::plain_cos_sin`] when the three axes are
/// equal. Returns cos/sin of shape `(seq_len, rotary_dim/2)`.
pub fn mrope_cos_sin(&self, position_ids: &Tensor) -> candle_core::Result<(Tensor, Tensor)> {
let pos = position_ids.to_dtype(DType::F32)?;
let (axes, seq_len) = pos.dims2()?;
debug_assert_eq!(axes, 3, "mrope position_ids must have 3 axes");
// Per-axis freqs: pos[a] (seq,1) @ inv_freq (1,half) → (seq,half).
let ft = pos.i(0)?.reshape((seq_len, 1))?.matmul(&self.inv_freq)?;
let fh = pos.i(1)?.reshape((seq_len, 1))?.matmul(&self.inv_freq)?;
let fw = pos.i(2)?.reshape((seq_len, 1))?.matmul(&self.inv_freq)?;
// Blend: each column belongs to exactly one axis (masks partition
// the half-dim), so this picks the right axis per frequency slot.
let blended = ft
.broadcast_mul(&self.mask_t)?
.add(&fh.broadcast_mul(&self.mask_h)?)?
.add(&fw.broadcast_mul(&self.mask_w)?)?;
let cos = blended.cos()?.to_dtype(self.dtype)?;
let sin = blended.sin()?.to_dtype(self.dtype)?;
Ok((cos, sin))
}
/// Apply rotary to `q`, `k` (shape `(B, H, L, head_dim)`) using
/// precomputed `cos`/`sin` of shape `(L, rotary_dim/2)`. Partial
/// rotary: only the first `rotary_dim` dims rotate; the tail passes
/// through unchanged.
pub fn apply_cos_sin(
&self,
q: &Tensor,
k: &Tensor,
offset: usize,
cos: &Tensor,
sin: &Tensor,
) -> candle_core::Result<(Tensor, Tensor)> {
let (_, _, seq_len, head_dim_in) = q.dims4()?;
let (_, _, _seq_len, head_dim_in) = q.dims4()?;
debug_assert_eq!(head_dim_in, self.head_dim, "q head_dim mismatch");
let cos = self.cos.narrow(0, offset, seq_len)?;
let sin = self.sin.narrow(0, offset, seq_len)?;
if self.rotary_dim == self.head_dim {
// Full rotation.
let q_embed = candle_nn::rotary_emb::rope_slow(&q.contiguous()?, &cos, &sin)?;
let k_embed = candle_nn::rotary_emb::rope_slow(&k.contiguous()?, &cos, &sin)?;
let q_embed = candle_nn::rotary_emb::rope_slow(&q.contiguous()?, cos, sin)?;
let k_embed = candle_nn::rotary_emb::rope_slow(&k.contiguous()?, cos, sin)?;
Ok((q_embed, k_embed))
} else {
// Partial rotation: narrow → rotate → cat the untouched tail.
@@ -102,8 +212,8 @@ impl RotaryEmbedding {
.narrow(candle_core::D::Minus1, 0, self.rotary_dim)?
.contiguous()?;
let k_pass = k.narrow(candle_core::D::Minus1, self.rotary_dim, tail)?;
let q_rotated = candle_nn::rotary_emb::rope_slow(&q_rot, &cos, &sin)?;
let k_rotated = candle_nn::rotary_emb::rope_slow(&k_rot, &cos, &sin)?;
let q_rotated = candle_nn::rotary_emb::rope_slow(&q_rot, cos, sin)?;
let k_rotated = candle_nn::rotary_emb::rope_slow(&k_rot, cos, sin)?;
let q_embed =
Tensor::cat(&[&q_rotated, &q_pass.contiguous()?], candle_core::D::Minus1)?;
let k_embed =
@@ -112,3 +222,358 @@ impl RotaryEmbedding {
}
}
}
/// Compute interleaved-M-RoPE 3D position ids for a full prompt that may
/// contain image-placeholder runs, plus the decode `rope_delta`.
///
/// Mirrors the reference `get_rope_index`:
/// - text tokens advance a single running counter `c`, all three axes
/// equal (`[c, c, c]`);
/// - each contiguous run of `image_token_id` is one image; its tokens get
/// `[base + t, base + h, base + w]` in row-major (t outer, h, w inner),
/// where `base` is the counter at the run's start; after the run the
/// counter resumes from `base + max(grid_t, grid_h, grid_w)`.
///
/// Returns `(text_pos, height_pos, width_pos, rope_delta)`, each pos `Vec`
/// length `input_ids.len()`. `rope_delta = final_counter - seq_len`: add it
/// to a plain decode offset so text resumes from the counter after the
/// (position-compressed) image blocks.
///
/// Whether interleaved M-RoPE for image tokens is enabled. Default
/// **on** — Qwen3.6 was trained with interleaved M-RoPE, and this
/// implementation matches the HF `apply_interleaved_mrope` /
/// `get_rope_index` reference exactly (verified column-for-column). The
/// env var is a **kill switch**: `NEURON_MROPE=0` falls back to plain
/// sequential positions for image tokens (the pre-M-RoPE behaviour).
pub(crate) fn mrope_enabled() -> bool {
std::env::var("NEURON_MROPE")
.map(|v| {
!matches!(
v.trim().to_ascii_lowercase().as_str(),
"0" | "false" | "no" | "off"
)
})
.unwrap_or(true)
}
/// Position ids for the forward path. Gated by [`mrope_enabled`]: when
/// off, returns plain sequential identity positions on all three axes
/// (`mrope_cos_sin` then reduces exactly to plain RoPE), restoring the
/// pre-M-RoPE behaviour without touching the rest of the forward.
pub(crate) fn get_rope_index(
input_ids: &[u32],
image_token_id: u32,
grids: &[(usize, usize)],
) -> Result<MRopeIndex> {
if !mrope_enabled() {
let seq: Vec<i64> = (0..input_ids.len() as i64).collect();
return Ok((seq.clone(), seq.clone(), seq, 0));
}
compute_mrope_index(input_ids, image_token_id, grids)
}
/// The real interleaved-M-RoPE position-id computation (always active in
/// unit tests; gated behind [`get_rope_index`] at runtime).
///
/// `grids` carries the post-merge LM grid `(lm_gh, lm_gw)` for each image
/// run, in prompt order — a run length alone cannot recover its
/// factorisation, so the grids must be passed (#14 dynamic resolution).
/// Each image is a still frame (`grid_t = 1`); its tokens get
/// `[base, base + hh, base + ww]` row-major and the shared counter
/// resumes at `base + max(lm_gh, lm_gw)`. Multi-image is correct because
/// the counter threads across images and interleaved text.
pub(crate) fn compute_mrope_index(
input_ids: &[u32],
image_token_id: u32,
grids: &[(usize, usize)],
) -> Result<MRopeIndex> {
let n = input_ids.len();
let mut text = Vec::with_capacity(n);
let mut height = Vec::with_capacity(n);
let mut width = Vec::with_capacity(n);
let mut counter: i64 = 0;
let mut i = 0;
let mut k = 0; // index into `grids`, one per image run
while i < n {
if input_ids[i] == image_token_id {
let start = i;
while i < n && input_ids[i] == image_token_id {
i += 1;
}
let run = i - start;
let (grid_h, grid_w) = *grids.get(k).ok_or_else(|| {
anyhow::anyhow!(
"get_rope_index: image run #{k} (len {run}) has no matching grid \
({} grids supplied)",
grids.len()
)
})?;
k += 1;
if grid_h * grid_w != run {
anyhow::bail!(
"get_rope_index: image run #{} length {run} != grid {grid_h}×{grid_w} = {}",
k - 1,
grid_h * grid_w
);
}
let base = counter;
for hh in 0..grid_h {
for ww in 0..grid_w {
text.push(base); // grid_t = 1 → temporal axis const
height.push(base + hh as i64);
width.push(base + ww as i64);
}
}
counter = base + grid_h.max(grid_w) as i64;
} else {
text.push(counter);
height.push(counter);
width.push(counter);
counter += 1;
i += 1;
}
}
if k != grids.len() {
anyhow::bail!(
"get_rope_index: prompt has {k} image run(s) but {} grid(s) were supplied",
grids.len()
);
}
let delta = counter - n as i64;
Ok((text, height, width, delta))
}
/// `(text_pos, height_pos, width_pos, rope_delta)` returned by
/// [`get_rope_index`]; the three vectors combine into the `(3, seq)`
/// MRoPE position-id tensor.
pub(crate) type MRopeIndex = (Vec<i64>, Vec<i64>, Vec<i64>, i64);
/// Build the `(3, seq)` position-id tensor consumed by
/// [`RotaryEmbedding::mrope_cos_sin`] from the three axis vectors.
///
/// Built directly as **f32** (positions are small integers, exact in
/// f32 well past any context length): the freqs matmul needs float
/// anyway, and this avoids an i64 tensor / i64→f32 cast on the GPU.
pub(crate) fn mrope_position_tensor(
text: &[i64],
height: &[i64],
width: &[i64],
dev: &Device,
) -> candle_core::Result<Tensor> {
let seq = text.len();
let mut flat = Vec::with_capacity(3 * seq);
flat.extend(text.iter().map(|&x| x as f32));
flat.extend(height.iter().map(|&x| x as f32));
flat.extend(width.iter().map(|&x| x as f32));
Tensor::from_vec(flat, (3, seq), dev)
}
#[cfg(test)]
mod tests {
use super::*;
use candle_core::IndexOp;
/// A TextConfig stub with Qwen3.6's rope params (head_dim 256,
/// partial 0.25 → rotary_dim 64 → half 32; section [11,11,10]).
fn qwen36_cfg() -> TextConfig {
serde_json::from_value(serde_json::json!({
"hidden_size": 5120,
"num_hidden_layers": 1,
"num_attention_heads": 64,
"num_key_value_heads": 8,
"head_dim": 256,
"intermediate_size": 1,
"vocab_size": 10,
"rms_norm_eps": 1e-6,
"max_position_embeddings": 64,
"layer_types": ["full_attention"],
"rope_parameters": {
"rope_theta": 10000000.0,
"partial_rotary_factor": 0.25,
"mrope_section": [11, 11, 10],
"mrope_interleaved": true
}
}))
.expect("cfg")
}
#[test]
fn mrope_masks_partition_the_half_dim() {
let (mt, mh, mw) = mrope_masks(32, &[11, 11, 10], true);
// Each column belongs to exactly one axis.
for i in 0..32 {
let s = mt[i] + mh[i] + mw[i];
assert_eq!(s, 1.0, "column {i} covered {s} times");
}
assert_eq!(mt.iter().sum::<f32>(), 11.0);
assert_eq!(mh.iter().sum::<f32>(), 11.0);
assert_eq!(mw.iter().sum::<f32>(), 10.0);
// Interleave: temporal 0,3,…; height 1,4,…; width 2,5,…
assert_eq!(mt[0], 1.0);
assert_eq!(mh[1], 1.0);
assert_eq!(mw[2], 1.0);
assert_eq!(mt[3], 1.0);
}
/// The load-bearing invariant: when all three position axes are
/// equal (text), `mrope_cos_sin` must reproduce `plain_cos_sin`
/// bit-for-bit — i.e. M-RoPE is a no-op for text, so text inference
/// is unchanged.
#[test]
fn mrope_reduces_to_plain_for_equal_axes() {
let dev = Device::Cpu;
let rope = RotaryEmbedding::new(DType::F32, &qwen36_cfg(), &dev).unwrap();
// positions 5,6,7 on all three axes.
let base: Vec<i64> = vec![5, 6, 7];
let pos =
Tensor::from_vec([base.clone(), base.clone(), base].concat(), (3, 3), &dev).unwrap();
let (mc, ms) = rope.mrope_cos_sin(&pos).unwrap();
let (pc, ps) = rope.plain_cos_sin(5, 3).unwrap();
let dcos = (mc - pc).unwrap().abs().unwrap().max_all().unwrap();
let dsin = (ms - ps).unwrap().abs().unwrap().max_all().unwrap();
assert!(
dcos.to_scalar::<f32>().unwrap() < 1e-6,
"cos mismatch {dcos:?}"
);
assert!(
dsin.to_scalar::<f32>().unwrap() < 1e-6,
"sin mismatch {dsin:?}"
);
}
/// Hand-checked interleave: a width-axis column (index 2) must track
/// the WIDTH position, while a temporal column (index 0) tracks the
/// TEXT position, even when the axes differ.
#[test]
fn mrope_blends_axes_at_interleave_columns() {
let dev = Device::Cpu;
let rope = RotaryEmbedding::new(DType::F32, &qwen36_cfg(), &dev).unwrap();
let half = rope.inv_freq.dim(1).unwrap();
let inv: Vec<f32> = rope.inv_freq.i(0).unwrap().to_vec1().unwrap();
// One token: text=10, height=3, width=7 — all distinct.
let pos = Tensor::from_vec(vec![10i64, 3, 7], (3, 1), &dev).unwrap();
let (cos, _sin) = rope.mrope_cos_sin(&pos).unwrap();
let cos_row: Vec<f32> = cos.i(0).unwrap().to_vec1().unwrap();
assert_eq!(cos_row.len(), half);
// Column 0 (temporal) → text pos 10. Column 1 (height) → 3.
// Column 2 (width) → 7.
assert!((cos_row[0] - (10.0 * inv[0]).cos()).abs() < 1e-5);
assert!((cos_row[1] - (3.0 * inv[1]).cos()).abs() < 1e-5);
assert!((cos_row[2] - (7.0 * inv[2]).cos()).abs() < 1e-5);
assert!((cos_row[3] - (10.0 * inv[3]).cos()).abs() < 1e-5);
}
#[test]
fn get_rope_index_text_only_is_sequential() {
let (t, h, w, delta) = compute_mrope_index(&[1, 2, 3, 4], 99, &[]).unwrap();
assert_eq!(t, vec![0, 1, 2, 3]);
assert_eq!(h, vec![0, 1, 2, 3]);
assert_eq!(w, vec![0, 1, 2, 3]);
assert_eq!(delta, 0, "no image → delta 0 → plain decode positions");
}
#[test]
fn get_rope_index_text_image_text() {
// [text, image(2x2 run of 4), text]. image_token = 99, grid (2,2).
let ids = [1u32, 99, 99, 99, 99, 2];
let (t, h, w, delta) = compute_mrope_index(&ids, 99, &[(2, 2)]).unwrap();
// token 0: text → 0. image base=1, grid 2x2:
// t all = 1; h = base+row = [1,1,2,2]; w = base+col = [1,2,1,2].
// resume from base + max(2,2) = 3. trailing text → 3.
assert_eq!(t, vec![0, 1, 1, 1, 1, 3]);
assert_eq!(h, vec![0, 1, 1, 2, 2, 3]);
assert_eq!(w, vec![0, 1, 2, 1, 2, 3]);
// final counter = 4, seq_len = 6 → delta = -2 (the 4 image tokens
// advanced the counter by only 2).
assert_eq!(delta, -2);
// Decode after the prompt (offset = 6) → text position 6 + (-2) = 4.
assert_eq!(6 + delta, 4);
}
#[test]
fn get_rope_index_nonsquare_single_image() {
// text + image(2 rows × 3 cols = 6 tokens). grid (2,3).
let ids = [1u32, 99, 99, 99, 99, 99, 99];
let (t, h, w, delta) = compute_mrope_index(&ids, 99, &[(2, 3)]).unwrap();
// base = 1; row-major h = [0,0,0,1,1,1]+1, w = [0,1,2,0,1,2]+1.
assert_eq!(t, vec![0, 1, 1, 1, 1, 1, 1]);
assert_eq!(h, vec![0, 1, 1, 1, 2, 2, 2]);
assert_eq!(w, vec![0, 1, 2, 3, 1, 2, 3]);
// resume from base + max(2,3) = 4; seq_len 7, counter 4 → delta -3.
assert_eq!(delta, 4 - 7);
}
#[test]
fn get_rope_index_two_images_different_grids() {
// img(2x2)=4, text, img(1x3)=3. grids [(2,2),(1,3)].
let ids = [99, 99, 99, 99, 7, 99, 99, 99];
let (t, h, w, delta) = compute_mrope_index(&ids, 99, &[(2, 2), (1, 3)]).unwrap();
// img1 base=0 → t=0, h=[0,0,1,1], w=[0,1,0,1]; resume max(2,2)=2.
// text at counter 2. img2 base=3 → t=3, h=[3,3,3], w=[3,4,5];
// resume 3+max(1,3)=6.
assert_eq!(t, vec![0, 0, 0, 0, 2, 3, 3, 3]);
assert_eq!(h, vec![0, 0, 1, 1, 2, 3, 3, 3]);
assert_eq!(w, vec![0, 1, 0, 1, 2, 3, 4, 5]);
assert_eq!(delta, 6 - 8);
}
#[test]
fn get_rope_index_on_by_default() {
// With NEURON_MROPE unset (default ON), the runtime path returns
// the real interleaved-M-RoPE positions. (NEURON_MROPE=0 would fall
// back to identity; not asserted here since it depends on env.)
let (t, h, w, _delta) = get_rope_index(&[1, 99, 99, 99, 99, 2], 99, &[(2, 2)]).unwrap();
assert_eq!(t, vec![0, 1, 1, 1, 1, 3]);
assert_eq!(h, vec![0, 1, 1, 2, 2, 3]);
assert_eq!(w, vec![0, 1, 2, 1, 2, 3]);
}
#[test]
fn get_rope_index_grid_mismatches_error() {
// run length != grid product.
assert!(compute_mrope_index(&[99u32; 6], 99, &[(2, 2)]).is_err());
// too few grids for the number of image runs.
assert!(compute_mrope_index(&[99, 99, 7, 99], 99, &[(1, 2)]).is_err());
// too many grids.
assert!(compute_mrope_index(&[99, 99], 99, &[(1, 2), (1, 1)]).is_err());
}
#[test]
fn position_tensor_round_trips_through_mrope_cos_sin() {
// get_rope_index → (3,seq) tensor → mrope_cos_sin, and confirm an
// image token's height column tracks its grid row (not the text
// counter), i.e. the end-to-end position plumbing is wired right.
let dev = Device::Cpu;
let rope = RotaryEmbedding::new(DType::F32, &qwen36_cfg(), &dev).unwrap();
let ids = [1u32, 99, 99, 99, 99]; // text + 2x2 image
let (t, h, w, _d) = compute_mrope_index(&ids, 99, &[(2, 2)]).unwrap();
let pos = mrope_position_tensor(&t, &h, &w, &dev).unwrap();
assert_eq!(pos.dims(), &[3, 5]);
let (cos, _sin) = rope.mrope_cos_sin(&pos).unwrap();
assert_eq!(cos.dims(), &[5, rope.inv_freq.dim(1).unwrap()]);
let inv: Vec<f32> = rope.inv_freq.i(0).unwrap().to_vec1().unwrap();
// Last image token (index 4): grid (h=1, w=1) → base 1 → h=2, w=2.
// Height column (index 1) must track h-position 2, not text.
let last: Vec<f32> = cos.i(4).unwrap().to_vec1().unwrap();
assert!((last[1] - (2.0 * inv[1]).cos()).abs() < 1e-5);
}
#[test]
fn get_rope_index_196_is_14x14() {
let mut ids = vec![1u32]; // one text token
ids.extend(std::iter::repeat_n(99u32, 196));
let (t, h, w, _delta) = compute_mrope_index(&ids, 99, &[(14, 14)]).unwrap();
// image base = 1. Last image token (index 196) is grid (h=13,w=13).
assert_eq!(*t.last().unwrap(), 1, "grid_t=1 → temporal const at base");
assert_eq!(h[1], 1, "first image row at base");
assert_eq!(w[1], 1, "first image col at base");
assert_eq!(h[196], 1 + 13, "last image row = base + 13");
assert_eq!(w[196], 1 + 13, "last image col = base + 13");
}
}

View File

@@ -0,0 +1,299 @@
//! Cache-state snapshots for prefix KV caching (#11).
//!
//! A snapshot captures everything `clear_kv_cache` would destroy, at
//! one consistent token boundary:
//!
//! - full-attention layers: the `ConcatKvCache` k/v tensors,
//! - linear-attention layers: the GatedDeltaNet `conv_state` +
//! `recurrent_state`,
//! - the model-level `rope_delta` position counter.
//!
//! The GatedDeltaNet recurrent state cannot be rewound to an earlier
//! token, so a snapshot is only reusable when its entire token
//! sequence is an exact prefix of an incoming prompt — matching policy
//! lives in `harness/prefix_cache.rs`; this module is just the state
//! capture.
//!
//! ## Copy semantics
//!
//! Attention k/v snapshots share storage with the live cache:
//! `ConcatKvCache::append` never mutates stored tensors in place (it
//! `cat`s into fresh allocations), so a shallow `Tensor` clone stays
//! valid after the live cache moves on. The GDN states are
//! **deep-copied** in both directions (`Tensor::copy`): the CUDA
//! delta-rule kernels update the recurrent-state buffer in place, and
//! `flatten`/`contiguous` on an already-contiguous tensor is a view —
//! a shared-storage snapshot would be corrupted by the next forward.
use candle_core::Tensor;
/// Per-layer captured state. Variant kind must match the layer's
/// `AttentionKind` on restore.
pub enum LayerKvSnapshot {
/// `ConcatKvCache` contents. `None` when the cache was empty
/// (a zero-token snapshot — valid but useless; the registry never
/// stores one).
Full(Option<(Tensor, Tensor)>),
/// GatedDeltaNet state. Either tensor is `None` before the first
/// forward touches it.
Linear {
conv_state: Option<Tensor>,
recurrent_state: Option<Tensor>,
},
}
/// One consistent cache snapshot of a `Qwen3_5Model` (or its TP
/// mirror `tp_qwen3_5::TpQwen3_5Model`, whose per-rank shard state
/// has the same shape) at a token boundary. Fields are `pub(crate)`
/// so the TP module can construct/consume the same type; holders
/// outside the harness only ever pass it back to `restore_kv_cache`.
pub struct KvCacheSnapshot {
pub(crate) layers: Vec<LayerKvSnapshot>,
pub(crate) rope_delta: i64,
}
impl KvCacheSnapshot {
/// Number of layer snapshots held (test/diagnostic helper).
pub fn layer_count(&self) -> usize {
self.layers.len()
}
/// Total bytes of tensor data held by this snapshot. Used for the
/// prefix-cache VRAM budget. Attention k/v shares storage with the
/// live cache at capture time, but the live cache is cleared or
/// replaced before the next request, so counting the full size is
/// the honest steady-state figure.
pub fn size_bytes(&self) -> u64 {
fn t_bytes(t: &Tensor) -> u64 {
(t.elem_count() * t.dtype().size_in_bytes()) as u64
}
self.layers
.iter()
.map(|l| match l {
LayerKvSnapshot::Full(Some((k, v))) => t_bytes(k) + t_bytes(v),
LayerKvSnapshot::Full(None) => 0,
LayerKvSnapshot::Linear {
conv_state,
recurrent_state,
} => {
conv_state.as_ref().map(t_bytes).unwrap_or(0)
+ recurrent_state.as_ref().map(t_bytes).unwrap_or(0)
}
})
.sum()
}
}
#[cfg(test)]
mod tests {
use super::super::{Qwen3_5Model, RopeParameters, TextConfig};
use candle_core::{DType, Device, Tensor};
use std::collections::HashMap;
/// Tiny two-layer config covering both attention kinds.
fn tiny_config() -> TextConfig {
TextConfig {
vocab_size: 32,
hidden_size: 16,
intermediate_size: 32,
num_hidden_layers: 2,
num_attention_heads: 2,
num_key_value_heads: 1,
head_dim: 8,
max_position_embeddings: 64,
rope_parameters: RopeParameters {
rope_theta: 10000.0,
partial_rotary_factor: 0.5,
rope_type: None,
mrope_section: Vec::new(),
mrope_interleaved: false,
},
rms_norm_eps: 1e-6,
tie_word_embeddings: true,
attn_output_gate: true,
layer_types: vec!["linear_attention".into(), "full_attention".into()],
full_attention_interval: Some(4),
hidden_act: "silu".into(),
linear_num_value_heads: 4,
linear_num_key_heads: 2,
linear_key_head_dim: 4,
linear_value_head_dim: 4,
linear_conv_kernel_dim: 4,
}
}
/// Build a Qwen3_5Model from random weights written to a temp
/// safetensors file — the same `ShardedVarBuilder` path the real
/// loader uses.
fn tiny_model(cfg: &TextConfig) -> Qwen3_5Model {
let dev = Device::Cpu;
let randn = |shape: &[usize]| Tensor::randn(0f32, 0.2f32, shape, &dev).unwrap();
let h = cfg.hidden_size;
let inter = cfg.intermediate_size;
let key_dim = cfg.linear_key_head_dim * cfg.linear_num_key_heads;
let value_dim = cfg.linear_value_head_dim * cfg.linear_num_value_heads;
let conv_dim = key_dim * 2 + value_dim;
let nv = cfg.linear_num_value_heads;
let hd = cfg.head_dim;
let q_out = cfg.num_attention_heads * hd * 2;
let kv_out = cfg.num_key_value_heads * hd;
let mut t: HashMap<String, Tensor> = HashMap::new();
let p = "model.language_model";
t.insert(
format!("{p}.embed_tokens.weight"),
randn(&[cfg.vocab_size, h]),
);
t.insert(format!("{p}.norm.weight"), randn(&[h]));
for (i, kind) in cfg.layer_types.iter().enumerate() {
let lp = format!("{p}.layers.{i}");
t.insert(format!("{lp}.input_layernorm.weight"), randn(&[h]));
t.insert(format!("{lp}.post_attention_layernorm.weight"), randn(&[h]));
t.insert(format!("{lp}.mlp.gate_proj.weight"), randn(&[inter, h]));
t.insert(format!("{lp}.mlp.up_proj.weight"), randn(&[inter, h]));
t.insert(format!("{lp}.mlp.down_proj.weight"), randn(&[h, inter]));
match kind.as_str() {
"linear_attention" => {
let ap = format!("{lp}.linear_attn");
t.insert(format!("{ap}.in_proj_qkv.weight"), randn(&[conv_dim, h]));
t.insert(format!("{ap}.in_proj_z.weight"), randn(&[value_dim, h]));
t.insert(format!("{ap}.in_proj_b.weight"), randn(&[nv, h]));
t.insert(format!("{ap}.in_proj_a.weight"), randn(&[nv, h]));
t.insert(format!("{ap}.out_proj.weight"), randn(&[h, value_dim]));
t.insert(
format!("{ap}.conv1d.weight"),
randn(&[conv_dim, 1, cfg.linear_conv_kernel_dim]),
);
t.insert(format!("{ap}.dt_bias"), randn(&[nv]));
t.insert(format!("{ap}.A_log"), randn(&[nv]));
t.insert(
format!("{ap}.norm.weight"),
randn(&[cfg.linear_value_head_dim]),
);
}
"full_attention" => {
let ap = format!("{lp}.self_attn");
t.insert(format!("{ap}.q_proj.weight"), randn(&[q_out, h]));
t.insert(format!("{ap}.k_proj.weight"), randn(&[kv_out, h]));
t.insert(format!("{ap}.v_proj.weight"), randn(&[kv_out, h]));
t.insert(
format!("{ap}.o_proj.weight"),
randn(&[h, cfg.num_attention_heads * hd]),
);
t.insert(format!("{ap}.q_norm.weight"), randn(&[hd]));
t.insert(format!("{ap}.k_norm.weight"), randn(&[hd]));
}
other => panic!("unexpected layer type {other}"),
}
}
let dir = tempfile::tempdir().expect("tempdir");
let path = dir.path().join("model.safetensors");
candle_core::safetensors::save(&t, &path).expect("save safetensors");
// SAFETY: mmap of a file this test just wrote and nothing else
// mutates — same justification as the real loader.
let vb = unsafe {
candle_nn::var_builder::ShardedSafeTensors::var_builder(
std::slice::from_ref(&path),
DType::F32,
&dev,
)
.expect("build ShardedVarBuilder")
};
Qwen3_5Model::load(cfg, &vb).expect("load tiny qwen3_5 model")
}
fn forward_tokens(model: &mut Qwen3_5Model, tokens: &[u32], offset: usize) -> Vec<f32> {
let input = Tensor::new(tokens, &Device::Cpu)
.unwrap()
.unsqueeze(0)
.unwrap();
let hidden = model.forward(&input, offset).unwrap();
// Last-position hidden row — what the lm_head would consume.
let (_, l, _) = hidden.dims3().unwrap();
hidden
.narrow(1, l - 1, 1)
.unwrap()
.flatten_all()
.unwrap()
.to_vec1()
.unwrap()
}
fn max_abs_diff(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());
a.iter()
.zip(b)
.map(|(x, y)| (x - y).abs())
.fold(0f32, f32::max)
}
/// The gold test for #11: prefill a prefix, snapshot, perturb the
/// live state with unrelated tokens, restore, prefill only the
/// suffix — the result must match a fresh full prefill. Exercises
/// attention KV, GDN conv/recurrent state, and offset bookkeeping
/// in one pass; the perturbation step would corrupt a
/// shared-storage (non-deep-copied) GDN snapshot.
#[test]
fn restore_then_suffix_matches_full_prefill() {
let cfg = tiny_config();
let mut model = tiny_model(&cfg);
let prefix: &[u32] = &[1, 2, 3];
let suffix: &[u32] = &[4, 5, 6];
let full: Vec<u32> = prefix.iter().chain(suffix).copied().collect();
model.clear_kv_cache();
let h_full = forward_tokens(&mut model, &full, 0);
model.clear_kv_cache();
forward_tokens(&mut model, prefix, 0);
let snap = model.snapshot_kv_cache().expect("snapshot");
assert_eq!(snap.layer_count(), 2);
assert!(snap.size_bytes() > 0);
// Advance the live state past the snapshot boundary — a
// different continuation, as a subsequent request would be.
forward_tokens(&mut model, &[9, 8], prefix.len());
model.restore_kv_cache(&snap).expect("restore");
let h_restored = forward_tokens(&mut model, suffix, prefix.len());
let diff = max_abs_diff(&h_full, &h_restored);
assert!(diff < 1e-4, "restored-prefix forward diverged: {diff}");
// The snapshot must survive restore + forward cycles (deep
// copy of the in-place-mutated GDN state): restore again and
// expect the identical result.
model.restore_kv_cache(&snap).expect("second restore");
let h_again = forward_tokens(&mut model, suffix, prefix.len());
let diff = max_abs_diff(&h_restored, &h_again);
assert!(diff < 1e-6, "second restore diverged: {diff}");
}
/// Restoring must fully replace the live state, not blend with it
/// — a divergent continuation after restore equals the same
/// continuation after a fresh prefill of the prefix.
#[test]
fn restore_replaces_live_state() {
let cfg = tiny_config();
let mut model = tiny_model(&cfg);
let prefix: &[u32] = &[7, 7, 2, 5];
let cont: &[u32] = &[11, 13];
model.clear_kv_cache();
forward_tokens(&mut model, prefix, 0);
let h_fresh = forward_tokens(&mut model, cont, prefix.len());
model.clear_kv_cache();
forward_tokens(&mut model, prefix, 0);
let snap = model.snapshot_kv_cache().expect("snapshot");
forward_tokens(&mut model, &[3, 1, 4, 1, 5], prefix.len());
model.restore_kv_cache(&snap).expect("restore");
let h_restored = forward_tokens(&mut model, cont, prefix.len());
let diff = max_abs_diff(&h_fresh, &h_restored);
assert!(diff < 1e-5, "restore did not replace live state: {diff}");
}
}

View File

@@ -0,0 +1,843 @@
//! Qwen3.6 vision tower.
//!
//! 27 pre-norm ViT blocks with **LayerNorm** (with biases — not the
//! `(1+w)·x` RmsNorm the language model uses), fused QKV attention,
//! GELU-tanh MLP. Followed by a `merger` that LayerNorms each
//! 1152-dim vision token, spatially 2×2-merges them into 4608-dim
//! groups, and projects to the LM's 5120-dim hidden via
//! `linear_fc1 → GELU → linear_fc2`.
//!
//! Architecture spec sourced from beast's cached Qwen3.6-27B
//! safetensors header (Stage A0, see
//! `doc/vision-qwen3_6-spec.md`). All weight shapes confirmed
//! from the live `.safetensors` headers, not inferred.
//!
//! **Conv3d wrinkle.** The published `patch_embed.proj.weight` is 5D
//! `[1152, 3, 2, 16, 16]` — a 3D conv with kernel
//! `(t=2, h=16, w=16)`. Candle 0.10 has no Conv3d. For static images
//! we get away with a trick: when the temporal patch size is 2 and we
//! duplicate the still image along the temporal axis (`T = 2`,
//! frame_0 == frame_1), the Conv3d output equals a Conv2d run with
//! the *sum* of the two temporal weight slices:
//!
//! ```text
//! output = W_0 · frame_0 + W_1 · frame_1 + bias
//! = (W_0 + W_1) · frame + bias (static image)
//! ```
//!
//! So at load we sum-collapse the temporal axis and use a 4D
//! `Conv2d` kernel. Video support would have to do the real Conv3d
//! (different frames mean the trick fails) — tracked alongside the
//! dynamic-resolution work in issue #14.
//!
//! Forward signature (Stage A — no LM splice yet):
//!
//! ```text
//! fn forward(&self, image: &Tensor) -> Result<Tensor>
//! ```
//!
//! `image` is `(3, H, W)` f32, normalised by `preprocess::preprocess`.
//! Returns `(N_lm_tokens, out_hidden_size)` post-merger tokens ready
//! to splice into the LM's input embeddings at `<|image_pad|>`
//! positions. For Qwen3.6 at 448×448 → 28×28 patches → 14×14 = 196
//! LM tokens of dim 5120.
use anyhow::{Context, Result};
use candle_core::{D, DType, Device, IndexOp, Module, Tensor};
use candle_nn::var_builder::ShardedVarBuilder;
use candle_nn::{Conv2d, Conv2dConfig, Embedding, LayerNorm, Linear};
use serde::Deserialize;
fn env_truthy(name: &str) -> bool {
std::env::var(name)
.map(|v| {
matches!(
v.trim().to_ascii_lowercase().as_str(),
"1" | "true" | "yes" | "on"
)
})
.unwrap_or(false)
}
/// Legacy escape hatch: when set, use the original Stage-A sequential
/// `pos_embed` lookup instead of the bilinear grid interpolation.
/// Default off (interpolation on) — for A/B comparison only.
fn vision_legacy_pos() -> bool {
env_truthy("NEURON_VISION_LEGACY_POS")
}
/// Legacy escape hatch: when set, skip the 2D vision rotary in the ViT
/// attention (the original Stage-A behaviour). Default off (rotary on)
/// — for A/B comparison only.
fn vision_legacy_rope() -> bool {
env_truthy("NEURON_VISION_LEGACY_ROPE")
}
/// Qwen3.6 vision tower hyperparameters. Mirrors the `vision_config`
/// block of `config.json`. Only the fields we actually need are
/// captured; serde tolerates the rest.
#[derive(Debug, Clone, Deserialize)]
pub struct VisionConfig {
/// Number of ViT blocks (`depth: 27` for Qwen3.6).
pub depth: usize,
/// Vision-token dimension throughout the tower (1152 for Qwen3.6).
pub hidden_size: usize,
/// MLP intermediate dim (4304).
pub intermediate_size: usize,
/// Attention head count (16). `head_dim = hidden_size / num_heads`.
pub num_heads: usize,
/// Number of slots in the learned position embedding (2304).
/// Caps the maximum image patch count.
pub num_position_embeddings: usize,
/// Spatial patch edge in pixels (16).
pub patch_size: usize,
/// Temporal kernel depth in the patch embed (2 for Qwen3.6 — we
/// collapse this into a single Conv2d for static-image inference;
/// see the module-level Conv3d wrinkle).
pub temporal_patch_size: usize,
/// Patches grouped per LM token by the merger (2 → 2×2 = 4
/// patches per LM token).
pub spatial_merge_size: usize,
/// Vision input channels (3, RGB).
pub in_channels: usize,
/// Merger output dim — matches the LM's `hidden_size` (5120 for
/// Qwen3.6). The merger projects from vision dim → LM dim.
pub out_hidden_size: usize,
}
const LAYER_NORM_EPS: f64 = 1e-6;
/// Number of LM tokens emitted by the merger per vision-token group.
const LM_TOKENS_PER_MERGE_GROUP: usize = 1;
/// One ViT block: pre-LN → attn → residual; pre-LN → MLP → residual.
struct VisionBlock {
norm1: LayerNorm,
qkv: Linear,
proj: Linear,
norm2: LayerNorm,
fc1: Linear,
fc2: Linear,
num_heads: usize,
head_dim: usize,
}
impl VisionBlock {
fn load(cfg: &VisionConfig, vb: &ShardedVarBuilder) -> Result<Self> {
let h = cfg.hidden_size;
let head_dim = h / cfg.num_heads;
let norm1 = layer_norm(vb.pp("norm1"), h)?;
let qkv = linear(vb.pp("attn.qkv"), h, 3 * h)?;
let proj = linear(vb.pp("attn.proj"), h, h)?;
let norm2 = layer_norm(vb.pp("norm2"), h)?;
let fc1 = linear(vb.pp("mlp.linear_fc1"), h, cfg.intermediate_size)?;
let fc2 = linear(vb.pp("mlp.linear_fc2"), cfg.intermediate_size, h)?;
Ok(Self {
norm1,
qkv,
proj,
norm2,
fc1,
fc2,
num_heads: cfg.num_heads,
head_dim,
})
}
/// `x`: `(N, hidden_size)` un-batched. `rotary`: optional
/// `(cos, sin)` each `(N, head_dim/2)` — the 2D vision rotary applied
/// to q/k. Returns same shape.
fn forward(&self, x: &Tensor, rotary: Option<&(Tensor, Tensor)>) -> Result<Tensor> {
let attn_in = self.norm1.forward(x)?;
let attn_out = self.attention(&attn_in, rotary)?;
let x = x.add(&attn_out)?;
let mlp_in = self.norm2.forward(&x)?;
let mlp_out = self.fc2.forward(&gelu_tanh(&self.fc1.forward(&mlp_in)?)?)?;
x.add(&mlp_out).map_err(Into::into)
}
/// Multi-head self-attention over the patch sequence. No causal
/// mask — every patch attends to every other patch. When `rotary` is
/// given, the 2D vision rotary (row/col position) is applied to q, k
/// before the scores, matching HF `apply_rotary_pos_emb_vision`
/// (`rope_slow` is the same rotate-half form).
fn attention(&self, x: &Tensor, rotary: Option<&(Tensor, Tensor)>) -> Result<Tensor> {
let (n, hidden) = x.dims2()?;
// qkv: (N, 3*hidden). Split into Q, K, V each (N, hidden).
let qkv = self.qkv.forward(x)?;
let qkv = qkv.reshape((n, 3, self.num_heads, self.head_dim))?;
// Transpose to (3, num_heads, N, head_dim) for per-head views.
let qkv = qkv.permute((1, 2, 0, 3))?.contiguous()?;
let q = qkv.i(0)?;
let k = qkv.i(1)?;
let v = qkv.i(2)?;
// 2D vision rotary on q, k (full head_dim; rotate-half form).
let (q, k) = match rotary {
Some((cos, sin)) => {
let q = candle_nn::rotary_emb::rope_slow(&q.unsqueeze(0)?, cos, sin)?.squeeze(0)?;
let k = candle_nn::rotary_emb::rope_slow(&k.unsqueeze(0)?, cos, sin)?.squeeze(0)?;
(q, k)
}
None => (q, k),
};
let scale = 1.0 / (self.head_dim as f64).sqrt();
// (num_heads, N, head_dim) @ (num_heads, head_dim, N) -> (num_heads, N, N)
let scores = q.matmul(&k.transpose(D::Minus2, D::Minus1)?)?;
let scores = (scores * scale)?;
let probs = candle_nn::ops::softmax_last_dim(&scores)?;
// (num_heads, N, N) @ (num_heads, N, head_dim) -> (num_heads, N, head_dim)
let out = probs.matmul(&v)?;
// Merge heads back: (N, num_heads, head_dim) -> (N, hidden).
let out = out.permute((1, 0, 2))?.contiguous()?.reshape((n, hidden))?;
self.proj.forward(&out).map_err(Into::into)
}
}
/// `merger`: LayerNorm per token → spatial 2×2 merge (concat 4
/// adjacent tokens into one 4608-dim vector) → fc1 → GELU-tanh →
/// fc2. Output dim is the LM's hidden_size.
struct VisionMerger {
norm: LayerNorm,
fc1: Linear,
fc2: Linear,
merge_input_dim: usize,
spatial_merge_size: usize,
}
impl VisionMerger {
fn load(cfg: &VisionConfig, vb: &ShardedVarBuilder) -> Result<Self> {
let h = cfg.hidden_size;
let merge = cfg.spatial_merge_size;
let merge_input_dim = h * merge * merge;
let norm = layer_norm(vb.pp("norm"), h)?;
let fc1 = linear(vb.pp("linear_fc1"), merge_input_dim, merge_input_dim)?;
let fc2 = linear(vb.pp("linear_fc2"), merge_input_dim, cfg.out_hidden_size)?;
Ok(Self {
norm,
fc1,
fc2,
merge_input_dim,
spatial_merge_size: merge,
})
}
/// `tokens`: `(grid_h, grid_w, hidden_size)`. The merger reshapes
/// each `merge×merge` block of adjacent patches into a single
/// concatenated vector, then projects.
///
/// `grid_h` and `grid_w` must both be multiples of
/// `spatial_merge_size`. Returns
/// `(grid_h/merge × grid_w/merge, out_hidden_size)`.
fn forward(&self, tokens: &Tensor) -> Result<Tensor> {
let (gh, gw, h) = tokens.dims3()?;
let m = self.spatial_merge_size;
anyhow::ensure!(
gh.is_multiple_of(m) && gw.is_multiple_of(m),
"merger expects spatial dims divisible by merge_size={m}; got ({gh}, {gw})"
);
let tokens = self.norm.forward(tokens)?;
// (gh, gw, h) -> (gh/m, m, gw/m, m, h) -> (gh/m, gw/m, m, m, h)
// -> flatten last three -> (gh/m, gw/m, m*m*h) -> (N_lm, merge_input_dim)
let out_h = gh / m;
let out_w = gw / m;
let merged = tokens
.reshape((out_h, m, out_w, m, h))?
.permute((0, 2, 1, 3, 4))?
.contiguous()?
.reshape((out_h * out_w, self.merge_input_dim))?;
let hidden = self.fc2.forward(&gelu_tanh(&self.fc1.forward(&merged)?)?)?;
Ok(hidden)
}
}
/// 2D rotary position embedding for the vision tower. Each patch's
/// `head_dim` rotates by its `(row, col)` grid coordinates: the first
/// half of the rotary freqs are driven by the row position, the second
/// half by the column. Mirrors HF `Qwen3VLVisionRotaryEmbedding` +
/// `rot_pos_emb` (θ = 10000, `dim = head_dim/2`).
struct VisionRotaryEmbedding {
/// `(half,)` f32, `half = head_dim/4` freqs per spatial axis.
inv_freq: Vec<f32>,
}
impl VisionRotaryEmbedding {
fn new(head_dim: usize) -> Self {
// HF: Qwen3VLVisionRotaryEmbedding(head_dim // 2), theta 10000.
let dim = head_dim / 2;
let theta = 10000f32;
let inv_freq = (0..dim)
.step_by(2)
.map(|i| 1f32 / theta.powf(i as f32 / dim as f32))
.collect();
Self { inv_freq }
}
/// cos/sin for a `gh×gw` patch grid in **row-major** order. Returns
/// `(cos, sin)` each `(gh*gw, head_dim/2)`: per patch, the row-axis
/// freqs `row·inv_freq` followed by the col-axis freqs `col·inv_freq`
/// (then `rope_slow` duplicates them across the full head_dim).
fn cos_sin(
&self,
gh: usize,
gw: usize,
dev: &Device,
dtype: DType,
) -> candle_core::Result<(Tensor, Tensor)> {
let half = self.inv_freq.len();
let n = gh * gw;
let mut data = Vec::with_capacity(n * 2 * half);
for hi in 0..gh {
for wi in 0..gw {
for &f in &self.inv_freq {
data.push(hi as f32 * f);
}
for &f in &self.inv_freq {
data.push(wi as f32 * f);
}
}
}
let freqs = Tensor::from_vec(data, (n, 2 * half), dev)?;
let cos = freqs.cos()?.to_dtype(dtype)?;
let sin = freqs.sin()?.to_dtype(dtype)?;
Ok((cos, sin))
}
}
/// The vision tower itself.
pub struct VisionTower {
/// Sum-collapsed temporal kernel (Conv2d, see module doc).
patch_embed: Conv2d,
pos_embed: Embedding,
rotary: VisionRotaryEmbedding,
blocks: Vec<VisionBlock>,
merger: VisionMerger,
config: VisionConfig,
dtype: DType,
device: Device,
}
impl VisionTower {
/// Load from a `ShardedVarBuilder` rooted at the safetensors
/// `model.visual.` prefix. Caller is responsible for the `pp` —
/// see `Qwen3_5ForCausalLM::new` (Stage A4).
pub fn load(cfg: VisionConfig, vb: ShardedVarBuilder) -> Result<Self> {
let dtype = vb.dtype();
let device = vb.device().clone();
// patch_embed.proj is published as 5D Conv3d weight; we
// sum-collapse the temporal axis (size = temporal_patch_size)
// to get a 4D Conv2d kernel. This is exact for the static-
// image case where T = temporal_patch_size frames are
// identical (i.e. the input was duplicated along T).
let raw_weight = vb
.pp("patch_embed.proj")
.get(
(
cfg.hidden_size,
cfg.in_channels,
cfg.temporal_patch_size,
cfg.patch_size,
cfg.patch_size,
),
"weight",
)
.context("load model.visual.patch_embed.proj.weight (5D Conv3d kernel)")?;
// Sum along the temporal axis (dim 2) — see module doc-comment.
let folded = raw_weight.sum(2)?; // -> (hidden, in_channels, patch, patch)
let proj_bias = vb
.pp("patch_embed.proj")
.get(cfg.hidden_size, "bias")
.context("load model.visual.patch_embed.proj.bias")?;
let conv_cfg = Conv2dConfig {
stride: cfg.patch_size,
..Default::default()
};
let patch_embed = Conv2d::new(folded, Some(proj_bias), conv_cfg);
let pos_embed_weight = vb
.pp("pos_embed")
.get((cfg.num_position_embeddings, cfg.hidden_size), "weight")
.context("load model.visual.pos_embed.weight")?;
let pos_embed = Embedding::new(pos_embed_weight, cfg.hidden_size);
let rotary = VisionRotaryEmbedding::new(cfg.hidden_size / cfg.num_heads);
let blocks_vb = vb.pp("blocks");
let mut blocks = Vec::with_capacity(cfg.depth);
for i in 0..cfg.depth {
blocks.push(
VisionBlock::load(&cfg, &blocks_vb.pp(i))
.with_context(|| format!("load vision block {i}"))?,
);
}
let merger = VisionMerger::load(&cfg, &vb.pp("merger")).context("load vision merger")?;
Ok(Self {
patch_embed,
pos_embed,
rotary,
blocks,
merger,
config: cfg,
dtype,
device,
})
}
pub fn config(&self) -> &VisionConfig {
&self.config
}
/// Number of LM tokens this tower emits for an `(H, W)` pixel
/// image after the merger. Equal to
/// `(H / patch_size / spatial_merge_size) * (W / patch_size / spatial_merge_size)`.
pub fn lm_tokens_for(&self, h: u32, w: u32) -> usize {
let m = self.config.spatial_merge_size;
let patch = self.config.patch_size;
let gh = (h as usize) / patch / m;
let gw = (w as usize) / patch / m;
gh * gw * LM_TOKENS_PER_MERGE_GROUP
}
/// Bilinearly interpolate the learned `pos_embed` grid (a
/// `num_grid_per_side × num_grid_per_side` table, 48×48 for Qwen3.6)
/// onto the actual `gh × gw` patch grid, in **row-major** patch
/// order. Port of the HF `fast_pos_embed_interpolate`: for each patch
/// at fractional grid coord `(linspace(0, ngrid-1, gh)[hi],
/// linspace(0, ngrid-1, gw)[wi])`, blend the 4 surrounding grid
/// entries by bilinear weights. Returns `(gh*gw, hidden)` in
/// `self.dtype`.
fn interpolated_pos_embed(&self, gh: usize, gw: usize) -> Result<Tensor> {
let ngrid = (self.config.num_position_embeddings as f64).sqrt().round() as usize;
anyhow::ensure!(
ngrid * ngrid == self.config.num_position_embeddings,
"num_position_embeddings {} is not a perfect square",
self.config.num_position_embeddings
);
// Evenly-spaced fractional indices into the [0, ngrid-1] grid.
let lin = |n: usize| -> Vec<f64> {
if n <= 1 {
vec![0.0]
} else {
let step = (ngrid - 1) as f64 / (n - 1) as f64;
(0..n).map(|i| i as f64 * step).collect()
}
};
let hs = lin(gh);
let ws = lin(gw);
let n = gh * gw;
// Four corner index sets + bilinear weight sets, row-major.
let mut idx: [Vec<u32>; 4] = [
Vec::with_capacity(n),
Vec::with_capacity(n),
Vec::with_capacity(n),
Vec::with_capacity(n),
];
let mut wts: [Vec<f32>; 4] = [
Vec::with_capacity(n),
Vec::with_capacity(n),
Vec::with_capacity(n),
Vec::with_capacity(n),
];
for &hv in &hs {
let hf = hv as usize; // floor (hv >= 0)
let hc = (hf + 1).min(ngrid - 1);
let dh = (hv - hf as f64) as f32;
for &wv in &ws {
let wf = wv as usize;
let wc = (wf + 1).min(ngrid - 1);
let dw = (wv - wf as f64) as f32;
idx[0].push((hf * ngrid + wf) as u32);
wts[0].push((1.0 - dh) * (1.0 - dw));
idx[1].push((hf * ngrid + wc) as u32);
wts[1].push((1.0 - dh) * dw);
idx[2].push((hc * ngrid + wf) as u32);
wts[2].push(dh * (1.0 - dw));
idx[3].push((hc * ngrid + wc) as u32);
wts[3].push(dh * dw);
}
}
// Blend in f32 and cast once at the end — the reference keeps
// the bilinear weights f32 against bf16 table rows; rounding
// the weights to bf16 first costs a visible slice of fixture
// parity (#15).
let mut acc: Option<Tensor> = None;
for corner in 0..4 {
let idx_t = Tensor::from_vec(std::mem::take(&mut idx[corner]), (n,), &self.device)?;
let emb = self
.pos_embed
.forward(&idx_t)?
.to_dtype(candle_core::DType::F32)?; // (n, hidden)
let wt = Tensor::from_vec(std::mem::take(&mut wts[corner]), (n, 1), &self.device)?;
let term = emb.broadcast_mul(&wt)?;
acc = Some(match acc {
Some(a) => a.add(&term)?,
None => term,
});
}
acc.expect("4 corners accumulated")
.to_dtype(self.dtype)
.map_err(Into::into)
}
/// Encode one image.
///
/// `image`: row-major `(3, H, W)` f32 tensor on `self.device`,
/// already normalised by `preprocess::preprocess`. Both `H` and
/// `W` must be multiples of `patch_size * spatial_merge_size`.
///
/// Returns `(N_lm, out_hidden_size)` — LM-side image tokens
/// ready to splice into the language model's input embeddings.
pub fn forward(&self, image: &Tensor) -> Result<Tensor> {
let (c, h, w) = image.dims3()?;
anyhow::ensure!(
c == self.config.in_channels,
"image must have {} channels, got {c}",
self.config.in_channels
);
let patch = self.config.patch_size;
anyhow::ensure!(
h.is_multiple_of(patch) && w.is_multiple_of(patch),
"image dims must be multiples of patch_size={patch}; got ({h}, {w})"
);
let gh = h / patch;
let gw = w / patch;
let n_patches = gh * gw;
anyhow::ensure!(
n_patches <= self.config.num_position_embeddings,
"patch count {n_patches} exceeds pos_embed budget {}",
self.config.num_position_embeddings
);
// Add batch axis for conv: (1, 3, H, W) → (1, hidden, gh, gw)
// → (hidden, gh, gw) → permute to (gh, gw, hidden) → flatten to (N, hidden)
let x = image.unsqueeze(0)?.to_dtype(self.dtype)?;
let x = self.patch_embed.forward(&x)?;
let x = x.squeeze(0)?;
let x = x.permute((1, 2, 0))?.contiguous()?;
let x = x.reshape((n_patches, self.config.hidden_size))?;
// Learned absolute position embeddings. The `pos_embed` table is
// a `num_position_embeddings = num_grid_per_side²` learned grid
// (48×48 for Qwen3.6); for a `gh×gw` patch grid the reference
// (`fast_pos_embed_interpolate`) bilinearly interpolates that
// grid to `gh×gw`. The legacy path (a naive sequential lookup of
// the first `n_patches` rows) mis-maps the grid stride and
// scrambles spatial structure — kept only behind
// `NEURON_VISION_LEGACY_POS=1` for A/B comparison.
let pos = if vision_legacy_pos() {
let positions = Tensor::arange(0u32, n_patches as u32, &self.device)?;
self.pos_embed.forward(&positions)?
} else {
self.interpolated_pos_embed(gh, gw)?
};
let mut x = x.add(&pos)?;
// 2D vision rotary (row/col per patch), computed once and applied
// in every block's attention. Legacy escape hatch skips it.
let rotary = if vision_legacy_rope() {
None
} else {
Some(self.rotary.cos_sin(gh, gw, &self.device, self.dtype)?)
};
let rotary_ref = rotary.as_ref();
for (i, block) in self.blocks.iter().enumerate() {
x = block
.forward(&x, rotary_ref)
.with_context(|| format!("vision block {i}"))?;
}
// (n_patches, hidden) → (gh, gw, hidden) for the merger.
let x = x.reshape((gh, gw, self.config.hidden_size))?;
self.merger.forward(&x)
}
}
/// Manually load a candle_nn LayerNorm from a ShardedVarBuilder.
/// candle_nn's `layer_norm` builder takes `crate::VarBuilder`, not
/// `ShardedVarBuilder`, so the existing arch modules in this crate
/// uniformly do the manual load + struct construction pattern (see
/// `full_attn::load_linear_no_bias`). We follow suit here.
fn layer_norm(vb: ShardedVarBuilder, size: usize) -> Result<LayerNorm> {
let weight = vb
.get(size, "weight")
.with_context(|| format!("load LayerNorm.weight at '{}'", vb.prefix()))?;
let bias = vb
.get(size, "bias")
.with_context(|| format!("load LayerNorm.bias at '{}'", vb.prefix()))?;
Ok(LayerNorm::new(weight, bias, LAYER_NORM_EPS))
}
/// Manually load a candle_nn Linear (with bias) from a
/// ShardedVarBuilder. Same rationale as `layer_norm` above.
fn linear(vb: ShardedVarBuilder, in_dim: usize, out_dim: usize) -> Result<Linear> {
let weight = vb
.get((out_dim, in_dim), "weight")
.with_context(|| format!("load Linear.weight at '{}'", vb.prefix()))?;
let bias = vb
.get(out_dim, "bias")
.with_context(|| format!("load Linear.bias at '{}'", vb.prefix()))?;
Ok(Linear::new(weight, Some(bias)))
}
/// PyTorch's `gelu_pytorch_tanh` approximation — what the Qwen3.6
/// vision tower's `hidden_act` specifies. candle's `Tensor::gelu`
/// uses the exact erf-based GELU, so we compute the tanh
/// approximation explicitly:
///
/// ```text
/// gelu_tanh(x) = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
/// ```
fn gelu_tanh(x: &Tensor) -> Result<Tensor> {
// sqrt(2 / pi) = 0.7978845608028654
const COEFF: f64 = 0.7978845608028654;
const KAPPA: f64 = 0.044715;
let x3 = x.powf(3.0)?;
let inner = (x + (x3 * KAPPA)?)?;
let inner = (inner * COEFF)?;
let t = inner.tanh()?;
let one_plus_t = (t + 1.0)?;
let out = (x * 0.5)?;
let out = out.broadcast_mul(&one_plus_t)?;
Ok(out)
}
#[cfg(test)]
mod tests {
use super::*;
use candle_core::{DType, Device};
/// Build a tiny VisionConfig usable on CPU with random weights.
/// Match the Qwen3.6 shape relations (depth-N stack, hidden mod
/// num_heads, intermediate_size > hidden_size) but with small
/// dims so tests run in milliseconds.
fn tiny_config() -> VisionConfig {
VisionConfig {
depth: 2,
hidden_size: 32,
intermediate_size: 64,
num_heads: 4,
num_position_embeddings: 64,
patch_size: 4,
temporal_patch_size: 2,
spatial_merge_size: 2,
in_channels: 3,
out_hidden_size: 48,
}
}
/// Hand-construct a VisionTower with random weights. This is the
/// same trick `linear_attn::tests::forward_smoke_with_tiny_dimensions`
/// uses — bypass the safetensors-backed `ShardedVarBuilder` path
/// (which can't be built from in-memory tensors) and assemble the
/// struct fields directly. The real `VisionTower::load` is
/// exercised by the cuda-integration smoke test in Stage A6.
fn tiny_tower(cfg: &VisionConfig) -> VisionTower {
let device = Device::Cpu;
let dtype = DType::F32;
let zeros = |shape: &[usize]| Tensor::zeros(shape, dtype, &device).unwrap();
let ones = |shape: &[usize]| Tensor::ones(shape, dtype, &device).unwrap();
let randn = |shape: &[usize]| Tensor::randn(0_f32, 0.02, shape, &device).unwrap();
let patch_embed = Conv2d::new(
randn(&[
cfg.hidden_size,
cfg.in_channels,
cfg.patch_size,
cfg.patch_size,
]),
Some(zeros(&[cfg.hidden_size])),
Conv2dConfig {
stride: cfg.patch_size,
..Default::default()
},
);
let pos_embed = Embedding::new(
randn(&[cfg.num_position_embeddings, cfg.hidden_size]),
cfg.hidden_size,
);
let mut blocks = Vec::with_capacity(cfg.depth);
for _ in 0..cfg.depth {
let head_dim = cfg.hidden_size / cfg.num_heads;
blocks.push(VisionBlock {
norm1: LayerNorm::new(
ones(&[cfg.hidden_size]),
zeros(&[cfg.hidden_size]),
LAYER_NORM_EPS,
),
qkv: Linear::new(
randn(&[3 * cfg.hidden_size, cfg.hidden_size]),
Some(zeros(&[3 * cfg.hidden_size])),
),
proj: Linear::new(
randn(&[cfg.hidden_size, cfg.hidden_size]),
Some(zeros(&[cfg.hidden_size])),
),
norm2: LayerNorm::new(
ones(&[cfg.hidden_size]),
zeros(&[cfg.hidden_size]),
LAYER_NORM_EPS,
),
fc1: Linear::new(
randn(&[cfg.intermediate_size, cfg.hidden_size]),
Some(zeros(&[cfg.intermediate_size])),
),
fc2: Linear::new(
randn(&[cfg.hidden_size, cfg.intermediate_size]),
Some(zeros(&[cfg.hidden_size])),
),
num_heads: cfg.num_heads,
head_dim,
});
}
let merge_input_dim = cfg.hidden_size * cfg.spatial_merge_size * cfg.spatial_merge_size;
let merger = VisionMerger {
norm: LayerNorm::new(
ones(&[cfg.hidden_size]),
zeros(&[cfg.hidden_size]),
LAYER_NORM_EPS,
),
fc1: Linear::new(
randn(&[merge_input_dim, merge_input_dim]),
Some(zeros(&[merge_input_dim])),
),
fc2: Linear::new(
randn(&[cfg.out_hidden_size, merge_input_dim]),
Some(zeros(&[cfg.out_hidden_size])),
),
merge_input_dim,
spatial_merge_size: cfg.spatial_merge_size,
};
let rotary = VisionRotaryEmbedding::new(cfg.hidden_size / cfg.num_heads);
VisionTower {
patch_embed,
pos_embed,
rotary,
blocks,
merger,
config: cfg.clone(),
dtype,
device,
}
}
#[test]
fn forward_with_random_weights_produces_finite_output() {
let cfg = tiny_config();
let tower = tiny_tower(&cfg);
// 16×16 image at patch_size=4 → 4×4 patches → after 2×2
// merge → 2×2 = 4 LM tokens of dim out_hidden_size.
let image = Tensor::randn(0_f32, 1.0, (3, 16, 16), &Device::Cpu).unwrap();
let out = tower.forward(&image).expect("forward");
let (n_lm, hidden) = out.dims2().unwrap();
assert_eq!(n_lm, 4);
assert_eq!(hidden, cfg.out_hidden_size);
// No NaN/Inf
let values: Vec<f32> = out.flatten_all().unwrap().to_vec1().unwrap();
assert!(
values.iter().all(|v| v.is_finite()),
"forward must produce finite values"
);
}
#[test]
fn interpolated_pos_embed_reduces_to_sequential_at_native_grid() {
// When the patch grid equals the pos_embed grid (gh=gw=ngrid),
// linspace(0,ngrid-1,ngrid) is the integer ladder, so every patch
// lands exactly on a grid node (dh=dw=0, corner-0 weight 1) and
// the bilinear result is the raw pos_embed rows in row-major
// order — i.e. identical to the legacy sequential lookup.
let cfg = tiny_config();
let tower = tiny_tower(&cfg);
let ngrid = (cfg.num_position_embeddings as f64).sqrt() as usize; // 8
let interp = tower.interpolated_pos_embed(ngrid, ngrid).unwrap();
let seq = tower
.pos_embed
.forward(&Tensor::arange(0u32, (ngrid * ngrid) as u32, &Device::Cpu).unwrap())
.unwrap();
let a: Vec<f32> = interp.flatten_all().unwrap().to_vec1().unwrap();
let b: Vec<f32> = seq.flatten_all().unwrap().to_vec1().unwrap();
assert_eq!(a.len(), b.len());
for (x, y) in a.iter().zip(b.iter()) {
assert!((x - y).abs() < 1e-5, "interp {x} vs seq {y}");
}
}
#[test]
fn vision_rotary_row_col_structure() {
// head_dim 8 → rotary dim 4 → inv_freq over [0,2] → 2 freqs/axis.
let rot = VisionRotaryEmbedding::new(8);
assert_eq!(rot.inv_freq.len(), 2);
let (cos, sin) = rot.cos_sin(2, 2, &Device::Cpu, DType::F32).unwrap();
assert_eq!(cos.dims(), &[4, 4]); // 4 patches, head_dim/2 = 4 cols
// Patch (0,0): all freqs 0 → cos 1, sin 0.
let s0: Vec<f32> = sin.i(0).unwrap().to_vec1().unwrap();
assert!(s0.iter().all(|&s| s.abs() < 1e-6));
// Patch index 2 = grid (1,0): row=1 drives the first half, col=0
// leaves the second half at zero.
let s2: Vec<f32> = sin.i(2).unwrap().to_vec1().unwrap();
assert!(s2[0].abs() > 1e-6, "row half must be non-zero");
assert!(
s2[2].abs() < 1e-6 && s2[3].abs() < 1e-6,
"col half must be zero"
);
}
#[test]
fn lm_token_count_matches_grid() {
let cfg = tiny_config();
let tower = tiny_tower(&cfg);
// 16x16 image → 4x4 patches → 2x2 = 4 LM tokens
assert_eq!(tower.lm_tokens_for(16, 16), 4);
// 32x32 image → 8x8 patches → 4x4 = 16 LM tokens
assert_eq!(tower.lm_tokens_for(32, 32), 16);
}
#[test]
fn rejects_image_with_dims_not_multiple_of_patch() {
let cfg = tiny_config();
let tower = tiny_tower(&cfg);
let image = Tensor::randn(0_f32, 1.0, (3, 17, 17), &Device::Cpu).unwrap();
let err = tower.forward(&image).unwrap_err();
assert!(format!("{err:#}").contains("patch_size"));
}
#[test]
fn rejects_image_with_wrong_channel_count() {
let cfg = tiny_config();
let tower = tiny_tower(&cfg);
let image = Tensor::randn(0_f32, 1.0, (4, 16, 16), &Device::Cpu).unwrap();
let err = tower.forward(&image).unwrap_err();
assert!(format!("{err:#}").contains("channels"));
}
#[test]
fn gelu_tanh_matches_known_values() {
// Reference values for gelu_pytorch_tanh from PyTorch:
// gelu_tanh(0.0) = 0.0
// gelu_tanh(1.0) ≈ 0.8411920071
// gelu_tanh(-1.0) ≈ -0.1588079929
let x = Tensor::new(&[0.0_f32, 1.0, -1.0], &Device::Cpu).unwrap();
let y = gelu_tanh(&x).unwrap();
let v: Vec<f32> = y.to_vec1().unwrap();
assert!((v[0]).abs() < 1e-6, "gelu_tanh(0) ≈ 0, got {}", v[0]);
assert!(
(v[1] - 0.841_192_f32).abs() < 1e-5,
"gelu_tanh(1) ≈ 0.84119, got {}",
v[1]
);
assert!(
(v[2] - -0.158_808_f32).abs() < 1e-5,
"gelu_tanh(-1) ≈ -0.15881, got {}",
v[2]
);
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -43,7 +43,7 @@
use anyhow::{Context, Result};
use cortex_core::openai::{ChatMessage, MessageContent};
use minijinja::Environment;
use minijinja::{Environment, Error as MjError, ErrorKind as MjErrorKind, Value as MjValue};
use serde_json::Value;
use std::path::Path;
@@ -65,12 +65,55 @@ pub fn chat_templates_enabled() -> bool {
}
}
/// Convenience: probe for `tokenizer_config.json` in the same
/// directory the tokenizer was loaded from. Both files come from
/// the same HuggingFace snapshot in the hf-hub cache, so the
/// sibling path is reliable.
/// Probe for the model's chat template in the same directory the
/// tokenizer was loaded from, following HuggingFace `transformers`
/// precedence: a standalone `chat_template.jinja` (then
/// `chat_template.json`) wins over the `chat_template` field in
/// `tokenizer_config.json`.
///
/// This matters for multimodal models: Qwen3-VL / Qwen3.6 ship their
/// vision-aware template (the one that emits
/// `<|vision_start|><|image_pad|><|vision_end|>` per image) **only** in
/// `chat_template.jinja`, and may not ship a `tokenizer_config.json` at
/// all. Reading `tokenizer_config.json` alone returned `None`, which
/// dropped image content into the text-only `format_qwen3_prompt`
/// fallback — so image requests rendered zero `<|image_pad|>` tokens
/// and the vision path bailed on the count mismatch.
pub fn load_chat_template_alongside(tokenizer_json_path: &Path) -> Option<String> {
let parent = tokenizer_json_path.parent()?;
// 1. Standalone Jinja file — raw template text, highest priority.
let jinja_path = parent.join("chat_template.jinja");
match std::fs::read_to_string(&jinja_path) {
Ok(text) if !text.trim().is_empty() => {
tracing::info!(
path = %jinja_path.display(),
"chat_template: loaded standalone chat_template.jinja"
);
return Some(text);
}
Ok(_) => {
tracing::warn!(
path = %jinja_path.display(),
"chat_template: chat_template.jinja present but empty; trying other sources"
);
}
Err(_) => {} // absent — fall through, common case
}
// 2. Standalone JSON file — `{"chat_template": "..."}` form.
let json_path = parent.join("chat_template.json");
if json_path.exists()
&& let Some(t) = load_chat_template_from(&json_path)
{
tracing::info!(
path = %json_path.display(),
"chat_template: loaded standalone chat_template.json"
);
return Some(t);
}
// 3. The `chat_template` field inside tokenizer_config.json.
let config_path = parent.join("tokenizer_config.json");
load_chat_template_from(&config_path)
}
@@ -148,6 +191,25 @@ pub fn render_chat_template(
kwargs: &Value,
) -> Result<String> {
let mut env = Environment::new();
// HF chat templates are authored against Python's Jinja2 with its
// string semantics. Bridge the two so real model templates render:
//
// - `pycompat::unknown_method_callback` supplies Python str/list/dict
// methods minijinja lacks natively (`startswith`, `endswith`,
// `split`, `rstrip`, `lstrip`, …) — the Qwen3.6 template uses
// several in its think-block and tool-response handling.
// - `raise_exception` is the global HF templates call to reject
// malformed inputs (e.g. an image in a system message). Map it to
// a render error so the caller falls back / surfaces it.
env.set_unknown_method_callback(minijinja_contrib::pycompat::unknown_method_callback);
env.add_function(
"raise_exception",
|msg: String| -> Result<MjValue, MjError> {
Err(MjError::new(MjErrorKind::InvalidOperation, msg))
},
);
// Compile the template against a fixed name so error messages
// surface "chat_template" rather than `<template>`.
env.add_template("chat_template", template)
@@ -159,7 +221,7 @@ pub fn render_chat_template(
// becomes a string; Parts becomes an array of content blocks.
// The HF templates handle both shapes via `content is string`
// checks or content-array iteration.
let messages_json: Vec<Value> = messages
let mut messages_json: Vec<Value> = messages
.iter()
.map(|m| {
let content_value = match &m.content {
@@ -181,6 +243,12 @@ pub fn render_chat_template(
})
.collect();
// OpenAI clients (opencode, the OpenAI SDK) carry tool-call
// `arguments` as a JSON *string*; Qwen3.6's template iterates it as a
// dict, so normalise string args to objects before rendering. Without
// this, `chat_template:120` errors "cannot convert value into pairs".
normalize_tool_call_arguments(&mut messages_json);
// Build the kwargs context. Add base bindings the template
// expects (`messages`, `add_generation_prompt`, `tools`) plus
// anything the caller passed in `chat_template_kwargs`. Caller
@@ -205,11 +273,150 @@ pub fn render_chat_template(
.context("render chat_template")
}
/// Normalize OpenAI-style tool-call `arguments` from JSON strings to
/// objects, in place, across all messages.
///
/// The OpenAI wire format carries `tool_calls[].function.arguments` as a
/// JSON *string*; HF chat templates (Qwen3.6 at `chat_template:120`)
/// iterate it as a dict (`arguments | items`), which throws "cannot
/// convert value into pairs" on a string. Parsing string args into the
/// object the template expects lets OpenAI and Anthropic clients both
/// render. A string that doesn't parse is left untouched — the render
/// then fails loudly rather than silently (see
/// `InferenceError::TemplateRenderFailed`).
fn normalize_tool_call_arguments(messages: &mut [Value]) {
for msg in messages {
let Some(tool_calls) = msg.get_mut("tool_calls").and_then(Value::as_array_mut) else {
continue;
};
for tc in tool_calls {
let Some(func) = tc.get_mut("function").and_then(Value::as_object_mut) else {
continue;
};
let parsed = match func.get("arguments") {
Some(Value::String(s)) => serde_json::from_str::<Value>(s).ok(),
_ => None,
};
if let Some(p) = parsed {
func.insert("arguments".into(), p);
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
/// Reproduces the Qwen3.6 vision template's image-insertion
/// condition against the OpenAI `image_url` content-part shape our
/// renderer forwards. Confirms minijinja's `'image_url' in item`
/// matches a serde_json object that carries that key — i.e. the
/// template *can* emit `<|image_pad|>` for our parts.
#[test]
fn image_url_part_renders_image_pad() {
// Condition copied from doc/vision-qwen3_6-spec.md (lines 8-18
// of the real chat_template.jinja).
let template = "{%- for message in messages -%}\
{%- if message.content is string -%}\
{{ message.content }}\
{%- else -%}\
{%- for item in message.content -%}\
{%- if 'image' in item or 'image_url' in item or item.type == 'image' -%}\
<|vision_start|><|image_pad|><|vision_end|>\
{%- elif item.type == 'text' -%}\
{{ item.text }}\
{%- endif -%}\
{%- endfor -%}\
{%- endif -%}\
{%- endfor -%}";
let messages = vec![ChatMessage {
role: "user".into(),
content: MessageContent::Parts(vec![
json!({"type": "text", "text": "what is this?"}),
json!({"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA="}}),
]),
extra: Value::Object(Default::default()),
}];
let out = render_chat_template(template, &messages, &Value::Null, &Value::Null)
.expect("render should succeed");
assert!(
out.contains("<|image_pad|>"),
"expected the image_url part to emit <|image_pad|>; rendered: {out:?}"
);
}
/// `chat_template.jinja` must win over `tokenizer_config.json`'s
/// `chat_template` field — the transformers precedence Qwen3.6
/// relies on (its vision template ships only in the `.jinja` file).
#[test]
fn standalone_jinja_template_takes_precedence() {
let dir = std::env::temp_dir().join(format!(
"neuron_ct_precedence_{}_{}",
std::process::id(),
line!()
));
std::fs::create_dir_all(&dir).unwrap();
std::fs::write(dir.join("chat_template.jinja"), "FROM_JINJA").unwrap();
std::fs::write(
dir.join("tokenizer_config.json"),
r#"{"chat_template": "FROM_CONFIG"}"#,
)
.unwrap();
// tokenizer_json_path is the sibling the loader takes a parent of.
let got = load_chat_template_alongside(&dir.join("tokenizer.json"));
std::fs::remove_dir_all(&dir).ok();
assert_eq!(got.as_deref(), Some("FROM_JINJA"));
}
/// With no standalone file, fall back to the tokenizer_config.json
/// field — the text-only path stays unchanged.
#[test]
fn falls_back_to_tokenizer_config_when_no_standalone() {
let dir = std::env::temp_dir().join(format!(
"neuron_ct_fallback_{}_{}",
std::process::id(),
line!()
));
std::fs::create_dir_all(&dir).unwrap();
std::fs::write(
dir.join("tokenizer_config.json"),
r#"{"chat_template": "FROM_CONFIG"}"#,
)
.unwrap();
let got = load_chat_template_alongside(&dir.join("tokenizer.json"));
std::fs::remove_dir_all(&dir).ok();
assert_eq!(got.as_deref(), Some("FROM_CONFIG"));
}
/// The *actual* Qwen3.6-27B `chat_template.jinja` (verbatim from
/// beast's HF cache) must render in minijinja and emit exactly one
/// `<|image_pad|>` for a text+image user turn. This is the real
/// end-to-end check the unit tests above only approximate — it
/// catches any minijinja incompatibility (namespace, macros,
/// reverse slice, string methods) before it reaches production.
#[test]
fn real_qwen3_6_template_renders_one_image_pad() {
let template = include_str!("testdata/qwen3_6_chat_template.jinja");
let messages = vec![ChatMessage {
role: "user".into(),
content: MessageContent::Parts(vec![
json!({"type": "text", "text": "what is this?"}),
json!({"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA="}}),
]),
extra: Value::Object(Default::default()),
}];
let out = render_chat_template(template, &messages, &Value::Null, &Value::Null)
.expect("real Qwen3.6 template should render in minijinja");
let pads = out.matches("<|image_pad|>").count();
assert_eq!(
pads, 1,
"expected exactly one <|image_pad|>; rendered:\n{out}"
);
assert!(out.contains("<|vision_start|>") && out.contains("<|vision_end|>"));
}
fn user_msg(text: &str) -> ChatMessage {
ChatMessage {
role: "user".into(),
@@ -389,4 +596,40 @@ THINK_OK\
let rendered = render_chat_template(template, &[msg], &Value::Null, &Value::Null).unwrap();
assert_eq!(rendered, "t1");
}
#[test]
fn normalizes_openai_string_tool_call_arguments_to_object() {
// The opencode / OpenAI-SDK shape: arguments as a JSON string.
let mut messages = vec![json!({
"role": "assistant",
"tool_calls": [{
"id": "c1", "type": "function",
"function": {"name": "Read", "arguments": "{\"path\":\"/x\"}"}
}]
})];
normalize_tool_call_arguments(&mut messages);
assert_eq!(
messages[0]["tool_calls"][0]["function"]["arguments"],
json!({"path": "/x"}),
"string args must become the object the template iterates"
);
}
#[test]
fn leaves_object_args_and_non_tool_messages_untouched() {
let mut messages = vec![
json!({"role": "user", "content": "hi"}),
json!({"role": "assistant", "tool_calls": [
{"function": {"name": "f", "arguments": {"a": 1}}}
]}),
];
normalize_tool_call_arguments(&mut messages);
// Already-object args pass through unchanged (Anthropic path).
assert_eq!(
messages[1]["tool_calls"][0]["function"]["arguments"],
json!({"a": 1})
);
// Ordinary messages are not disturbed.
assert_eq!(messages[0]["content"], "hi");
}
}

View File

@@ -0,0 +1,366 @@
//! Self-derived context/token limits (#67).
//!
//! The correct `limit{context,input,output}` for a deployment is not a
//! static fact an operator should memorise — it's a computed function of
//! things the neuron already knows better than any operator:
//!
//! - **model architecture** — `max_position_embeddings` and the
//! KV-cost-per-token implied by the attention layout;
//! - **live free VRAM** on the tightest card the model occupies, after
//! weights and an activation reserve;
//! - the **coherence/throughput trade-off** — "biggest that fits VRAM"
//! is not "biggest that's usable": with no cross-request KV reuse every
//! turn re-prefills the whole context, so there's a usable ceiling
//! below the VRAM ceiling (it rises as prefix caching / #11 lands).
//!
//! This module is the arch-agnostic physics + policy. Each arch's load
//! path builds a [`ContextProfile`] (the physics) via
//! [`kv_bytes_per_token`]; [`derive_limit`] applies the policy against
//! live VRAM + a self-measured prefill rate + [`ContextLimitConfig`].
//! qwen3_5 is the only arch wired today; a future standard
//! full-attention model is the simpler case (`n_full_attn_layers =
//! n_layers`) and drops in by constructing a `ContextProfile`.
use std::path::Path;
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Duration;
use cortex_core::harness::ModelLimit;
use crate::config::ContextLimitConfig;
/// EMA smoothing factor for the prefill-rate sample. Low enough that one
/// anomalous turn (a contended GPU, a cold cache) doesn't swing the
/// advertised limit, high enough to track a real shift (e.g. prefix
/// caching, #11, dropping effective prefill cost) within a few turns.
const PREFILL_EMA_ALPHA: f64 = 0.3;
/// Self-measured prefill throughput for one loaded model, as an
/// exponential moving average of tokens/sec (#67). Updated at the end of
/// each streaming request's prefill phase, read when deriving the
/// throughput ceiling. Lock-free: prefill is serialised per model (the
/// `inference_lock`), and the limit reader only needs a recent value.
/// Stores the f64 rate as raw bits; `0` means "no sample yet" → callers
/// fall back to the configured bootstrap estimate.
#[derive(Debug)]
pub struct PrefillRateEma {
bits: AtomicU64,
}
impl PrefillRateEma {
pub const fn new() -> Self {
Self {
bits: AtomicU64::new(0),
}
}
/// Fold one prefill measurement (`prompt_tokens` processed in
/// `elapsed`) into the EMA. No-op for degenerate inputs so a probe
/// request or a clock blip can't poison the average.
pub fn record(&self, prompt_tokens: usize, elapsed: Duration) {
let secs = elapsed.as_secs_f64();
if prompt_tokens == 0 || secs <= 0.0 {
return;
}
let sample = prompt_tokens as f64 / secs;
if !sample.is_finite() || sample <= 0.0 {
return;
}
let prev = f64::from_bits(self.bits.load(Ordering::Acquire));
let next = if prev > 0.0 {
PREFILL_EMA_ALPHA * sample + (1.0 - PREFILL_EMA_ALPHA) * prev
} else {
sample
};
self.bits.store(next.to_bits(), Ordering::Release);
}
/// The current measured rate (tokens/sec), or `None` before the
/// first sample lands.
pub fn get(&self) -> Option<f64> {
let v = f64::from_bits(self.bits.load(Ordering::Acquire));
(v.is_finite() && v > 0.0).then_some(v)
}
}
impl Default for PrefillRateEma {
fn default() -> Self {
Self::new()
}
}
/// Bytes per element of the KV cache. qwen3_5 keeps K/V in the model's
/// f16/bf16 compute dtype regardless of weight quantisation (ISQ
/// quantises weights, not the cache), so this is 2 for every supported
/// load. Matches the per-rank logging math in the TP load paths.
pub const KV_CACHE_DTYPE_BYTES: usize = 2;
/// Bytes of KV cache one token adds **per card**, counting only the
/// full-attention layers (linear/recurrent layers carry fixed-size
/// state, not a growing cache). Sharded across the TP world: per-rank
/// KV-head count is `n_kv_heads / world_size`.
///
/// `2 ×` accounts for K and V. Shared by the limit derivation here and
/// the per-rank load-time logging in the TP paths (and, in future, by
/// #65's length-aware pre-flight guard).
pub fn kv_bytes_per_token(
n_full_attn_layers: usize,
n_kv_heads: usize,
head_dim: usize,
dtype_bytes: usize,
world_size: u32,
) -> u64 {
let per_rank_kv_heads = (n_kv_heads / world_size.max(1) as usize).max(1);
(2 * n_full_attn_layers * per_rank_kv_heads * head_dim * dtype_bytes) as u64
}
/// Per-model physics needed to derive a context limit, captured at load
/// time (the arch config is consumed during model construction, so the
/// relevant numbers are snapshotted into this struct). Arch-agnostic:
/// the hybrid qwen3_5 case counts only its full-attention layers; a
/// standard transformer would pass `n_full_attn_layers = n_layers`.
#[derive(Debug, Clone, Copy)]
pub struct ContextProfile {
/// The model's native context ceiling (quality wall).
pub max_position_embeddings: usize,
/// KV bytes added per token, per card — from [`kv_bytes_per_token`].
pub kv_bytes_per_token_per_card: u64,
/// Tensor-parallel world size the model is loaded with (1 = single GPU).
pub world_size: u32,
}
/// Build a [`ContextProfile`] from a qwen3_5 `config.json` on disk
/// (mirrors `VisionMeta::from_config_path`). Returns `None` for any other
/// `model_type` or an unparseable config — those arches fall back to the
/// static prompt cap with no advertised limit. `world_size` is the TP
/// degree the model is loaded with (1 = single GPU).
///
/// KV grows only on full-attention layers; `layer_types` is authoritative
/// (every entry is `"full_attention"` or `"linear_attention"`), with the
/// `full_attention_interval` hint as a fallback when the array is absent.
pub fn profile_from_qwen3_5_config(config_path: &Path, world_size: u32) -> Option<ContextProfile> {
let text = std::fs::read_to_string(config_path).ok()?;
let model_type = serde_json::from_str::<serde_json::Value>(&text)
.ok()?
.get("model_type")?
.as_str()?
.to_owned();
if model_type != super::arch::qwen3_5::MODEL_TYPE {
return None;
}
let cfg: super::arch::qwen3_5::Config = serde_json::from_str(&text).ok()?;
let tc = &cfg.text_config;
let n_full_attn_layers = {
let counted = tc
.layer_types
.iter()
.filter(|t| t.as_str() == "full_attention")
.count();
if counted > 0 {
counted
} else {
// layer_types absent — derive from the interval hint.
let interval = tc.full_attention_interval.unwrap_or(4).max(1);
tc.num_hidden_layers / interval
}
};
let kv_bytes_per_token_per_card = kv_bytes_per_token(
n_full_attn_layers,
tc.num_key_value_heads,
tc.head_dim,
KV_CACHE_DTYPE_BYTES,
world_size,
);
Some(ContextProfile {
max_position_embeddings: tc.max_position_embeddings,
kv_bytes_per_token_per_card,
world_size,
})
}
/// Round a token count down to a clean boundary so the advertised limit
/// doesn't jitter by a handful of tokens as live VRAM / the throughput
/// EMA wobble between polls.
fn round_down(tokens: usize, granularity: usize) -> usize {
if granularity == 0 {
return tokens;
}
(tokens / granularity) * granularity
}
const CONTEXT_GRANULARITY: usize = 1024;
/// Derive `limit{context,input,output}` for a loaded model.
///
/// ```text
/// output = output_reserve_tokens
/// vram_ceiling = (free_tightest activation_headroom min_free_floor) / kv_bytes_per_token_per_card
/// throughput_ceiling = target_prefill_latency_secs × prefill_tok_per_sec
/// context = min(max_position_embeddings, vram_ceiling, throughput_ceiling) [clamped by `hard_ceiling` if set]
/// input = context output
/// ```
///
/// `free_tightest_mb` is the minimum free VRAM (MiB) across the model's
/// devices — the tightest card, which on a TP model is often a
/// non-leader rank. `prefill_tok_per_sec` is the model's self-measured
/// prefill rate (or a bootstrap estimate before the first sample).
/// `hard_ceiling` is an optional clamp-only backstop
/// (`NEURON_MAX_PROMPT_TOKENS` or a catalogue override); `None` = no clamp.
///
/// `reasoning`: `input = context output` keeps a generation reserve
/// below the wall; `output` (the reserve) is a *sub-budget* of context,
/// matching opencode's compaction model.
pub fn derive_limit(
profile: &ContextProfile,
free_tightest_mb: u64,
prefill_tok_per_sec: f64,
hard_ceiling: Option<usize>,
cfg: &ContextLimitConfig,
) -> ModelLimit {
let output = cfg.output_reserve_tokens;
// VRAM ceiling — what actually fits, from live free VRAM. A zero
// `free_tightest_mb` is the "unknown / no-context sentinel" (CPU
// build, or a failed per-rank query) → VRAM imposes no ceiling, the
// other terms bind, rather than collapsing the limit to zero.
let vram_ceiling = if free_tightest_mb == 0 {
usize::MAX
} else {
let reserved_mb = cfg
.activation_headroom_mb
.saturating_add(cfg.min_free_floor_mb);
let avail_bytes = free_tightest_mb
.saturating_sub(reserved_mb)
.saturating_mul(1024 * 1024);
// `checked_div` yields `None` for a degenerate zero-KV profile
// (e.g. no full-attention layers) → VRAM imposes no ceiling.
avail_bytes
.checked_div(profile.kv_bytes_per_token_per_card)
.map_or(usize::MAX, |t| t as usize)
};
// Throughput ceiling — usable, not just fittable. Fall back to the
// bootstrap estimate until the model has measured its own rate.
let tok_per_sec = if prefill_tok_per_sec.is_finite() && prefill_tok_per_sec > 0.0 {
prefill_tok_per_sec
} else {
cfg.bootstrap_prefill_tok_per_sec
};
let throughput_ceiling = (cfg.target_prefill_latency_secs * tok_per_sec).max(0.0) as usize;
let mut context = profile
.max_position_embeddings
.min(vram_ceiling)
.min(throughput_ceiling);
if let Some(clamp) = hard_ceiling {
context = context.min(clamp);
}
context = round_down(context, CONTEXT_GRANULARITY);
let input = context.saturating_sub(output);
ModelLimit {
context,
input: Some(input),
output,
}
}
#[cfg(test)]
mod tests {
use super::*;
/// beast Qwen3.6-27B: 16 full-attn layers, 4 kv heads, head_dim 256,
/// f16 (2 B), TP=2 → 64 KiB/token total, 32 KiB/token/card.
fn beast_profile() -> ContextProfile {
let kv = kv_bytes_per_token(16, 4, 256, 2, 2);
ContextProfile {
max_position_embeddings: 262144,
kv_bytes_per_token_per_card: kv,
world_size: 2,
}
}
#[test]
fn kv_bytes_matches_hand_derivation() {
// 2 × 16 × (4/2) × 256 × 2 = 32 KiB per card.
assert_eq!(kv_bytes_per_token(16, 4, 256, 2, 2), 32 * 1024);
// Single-GPU (world=1) doubles the per-card cost: 64 KiB.
assert_eq!(kv_bytes_per_token(16, 4, 256, 2, 1), 64 * 1024);
}
#[test]
fn throughput_ceiling_binds_pre_prefix_cache() {
// ~850 tok/s × 120 s ≈ 102k → the coherence wall binds below the
// VRAM ceiling on beast pre-#11. VRAM (~9.2 GB free) allows far
// more, max_position_embeddings is 262144, so throughput wins.
let cfg = ContextLimitConfig::default();
let limit = derive_limit(&beast_profile(), 9254, 850.0, None, &cfg);
// 120 × 850 = 102000 → rounded down to 1024 → 101376.
assert_eq!(limit.context, 101376);
assert_eq!(limit.output, 8192);
assert_eq!(limit.input, Some(101376 - 8192));
assert!(limit.input.unwrap() < limit.context);
}
#[test]
fn faster_prefill_raises_the_limit() {
// Prefix caching (#11) speeds effective prefill → ceiling rises,
// eventually pinned by VRAM / max_position_embeddings.
let cfg = ContextLimitConfig::default();
let slow = derive_limit(&beast_profile(), 9254, 850.0, None, &cfg);
let fast = derive_limit(&beast_profile(), 9254, 8500.0, None, &cfg);
assert!(fast.context > slow.context);
}
#[test]
fn tighter_vram_lowers_the_limit() {
// Same model, less free VRAM → VRAM ceiling binds below throughput.
let cfg = ContextLimitConfig::default();
let roomy = derive_limit(&beast_profile(), 9254, 8500.0, None, &cfg);
let tight = derive_limit(&beast_profile(), 2600, 8500.0, None, &cfg);
assert!(tight.context < roomy.context);
}
#[test]
fn hard_ceiling_clamps_only_downward() {
let cfg = ContextLimitConfig::default();
// A backstop below the derived value clamps it.
let clamped = derive_limit(&beast_profile(), 9254, 8500.0, Some(49152), &cfg);
assert_eq!(clamped.context, 49152);
// A backstop above the derived value is a no-op.
let unclamped = derive_limit(&beast_profile(), 9254, 850.0, Some(200000), &cfg);
assert_eq!(unclamped.context, 101376);
}
#[test]
fn prefill_ema_tracks_and_ignores_degenerate_samples() {
let ema = PrefillRateEma::new();
assert_eq!(ema.get(), None);
// First real sample seeds the average exactly.
ema.record(1000, Duration::from_secs(1));
assert_eq!(ema.get(), Some(1000.0));
// Degenerate inputs are ignored (no poisoning).
ema.record(0, Duration::from_secs(1));
ema.record(1000, Duration::from_secs(0));
assert_eq!(ema.get(), Some(1000.0));
// A faster sample pulls the EMA up but is smoothed (alpha 0.3):
// 0.3*2000 + 0.7*1000 = 1300.
ema.record(2000, Duration::from_secs(1));
assert!((ema.get().unwrap() - 1300.0).abs() < 1e-6);
}
#[test]
fn zero_kv_cost_falls_back_to_other_ceilings() {
// A degenerate profile (no full-attn layers) must not divide by
// zero — VRAM ceiling becomes unbounded, others still apply.
let profile = ContextProfile {
max_position_embeddings: 32768,
kv_bytes_per_token_per_card: 0,
world_size: 1,
};
let cfg = ContextLimitConfig::default();
let limit = derive_limit(&profile, 8000, 8500.0, None, &cfg);
// max_position_embeddings (32768) binds below throughput (~1.02M).
assert_eq!(limit.context, 32768);
}
}

View File

@@ -13,13 +13,15 @@
//! ARCH model state in this state slab will gain a companion
//! `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>`.
use crate::harness::arch::qwen3_5::snapshot::KvCacheSnapshot;
use crate::harness::candle::ModelArch;
#[cfg(feature = "cuda")]
use crate::harness::device_worker::jobs::TpHandle;
use crate::harness::device_worker::jobs::{ArchHandle, Job};
use crate::harness::device_worker::jobs::{ArchHandle, ImageInput, Job, KvSnapshotId};
#[cfg(feature = "cuda")]
use crate::harness::tp::TpLeaderModel;
use crate::harness::tp::nccl_state::NcclState;
use anyhow::Context as _;
use std::collections::HashMap;
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
@@ -45,6 +47,14 @@ struct DeviceWorkerState {
/// increments and returns the new value. Wraps at u64::MAX after
/// ~10^19 model loads — not a practical concern.
next_handle: u64,
/// Prefix-cache snapshots (#11), keyed by the owning model's
/// handle plus a per-worker snapshot counter. Kept beside the
/// model slab (not inside it) so every existing `get_mut` on
/// `models` stays untouched; `DropArch` retains this map down so
/// snapshot tensors drop on this thread alongside the model's.
kv_snapshots: HashMap<(ArchHandle, u64), KvCacheSnapshot>,
/// Counter for minting fresh `KvSnapshotId`s.
next_kv_snapshot_id: u64,
/// Leader's NCCL state. Populated by `Job::NcclInit`; the
/// underlying `Comm`'s libnccl handle lives bound to this thread
/// for its entire lifetime. Subprocess workers maintain their own
@@ -59,6 +69,12 @@ struct DeviceWorkerState {
/// Counter for minting fresh `TpHandle`s.
#[cfg(feature = "cuda")]
next_tp_handle: u64,
/// Leader-side TP prefix snapshots (#11), keyed by the owning TP
/// handle plus the **pool-minted** snapshot id (no local counter —
/// the id must match what the subprocess ranks stored). `DropTp`
/// retains this map down with the model.
#[cfg(feature = "cuda")]
tp_kv_snapshots: HashMap<(TpHandle, u64), KvCacheSnapshot>,
#[cfg(feature = "cuda")]
#[allow(dead_code)]
/// `None` only if `CudaContext::new()` failed — in that case the
@@ -123,6 +139,10 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
Job::DropArch { handle, reply } => {
let removed = state.models.remove(&handle);
let was_present = removed.is_some();
// Prefix snapshots are scoped to the model: drop them
// here (on this thread) so a stale async-side id can
// never resurrect tensors from an unloaded model.
state.kv_snapshots.retain(|(h, _), _| *h != handle);
// Explicit drop on this thread — runs the Box<ModelArch>
// Drop with the CUDA context bound here, which frees
// all device tensors on the right context. The Drop is
@@ -149,6 +169,76 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
}
let _ = reply.send(result);
}
Job::SnapshotKv { handle, reply } => {
let result = match state.models.get(&handle) {
Some(arch) => arch.snapshot_kv_cache().map(|snap| {
let id = KvSnapshotId(state.next_kv_snapshot_id);
state.next_kv_snapshot_id = state.next_kv_snapshot_id.wrapping_add(1);
let bytes = snap.size_bytes();
state.kv_snapshots.insert((handle, id.0), snap);
tracing::debug!(
device_index,
handle = handle.0,
snapshot = id.0,
bytes,
stored = state.kv_snapshots.len(),
"device worker: kv snapshot captured"
);
(id, bytes)
}),
None => Err(anyhow::anyhow!(
"SnapshotKv: no model for handle {}",
handle.0
)),
};
let _ = reply.send(result);
}
Job::RestoreKv {
handle,
snapshot,
reply,
} => {
let result = match (
state.models.get_mut(&handle),
state.kv_snapshots.get(&(handle, snapshot.0)),
) {
(Some(arch), Some(snap)) => arch.restore_kv_cache(snap),
(None, _) => Err(anyhow::anyhow!(
"RestoreKv: no model for handle {}",
handle.0
)),
(_, None) => Err(anyhow::anyhow!(
"RestoreKv: no snapshot {} for handle {}",
snapshot.0,
handle.0
)),
};
// The replaced live cache state just freed its
// tensors — same release-to-driver point as ClearKv.
if result.is_ok() {
trim_device_pool(&state);
}
let _ = reply.send(result);
}
Job::DropKvSnapshot {
handle,
snapshot,
reply,
} => {
let was_present = state.kv_snapshots.remove(&(handle, snapshot.0)).is_some();
if was_present {
trim_device_pool(&state);
}
tracing::debug!(
device_index,
handle = handle.0,
snapshot = snapshot.0,
was_present,
stored = state.kv_snapshots.len(),
"device worker: kv snapshot dropped"
);
let _ = reply.send(());
}
Job::ForwardLogits {
handle,
tokens,
@@ -158,6 +248,35 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
let result = forward_logits(&mut state, handle, &tokens, offset);
let _ = reply.send(result);
}
Job::EncodeImage {
handle,
pixels,
c,
h,
w,
reply,
} => {
let result = encode_image(&mut state, handle, pixels, c, h, w);
let _ = reply.send(result);
}
Job::ForwardLogitsWithImages {
handle,
tokens,
offset,
images,
image_token_id,
reply,
} => {
let result = forward_logits_with_images(
&mut state,
handle,
&tokens,
offset,
images,
image_token_id,
);
let _ = reply.send(result);
}
Job::NcclInit {
cfg,
comm_id_hex,
@@ -171,6 +290,16 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
let _ = reply.send(resp);
}
#[cfg(feature = "cuda")]
Job::GetLeaderComm { reply } => {
// Clone the leader's Arc<Comm> out for the async-side
// watchdog. `None` before NcclInit. (#17 Stage 2)
let comm = state
.nccl
.comm()
.map(crate::harness::tp::nccl_state::SendComm);
let _ = reply.send(comm);
}
#[cfg(feature = "cuda")]
Job::TpLoadShard {
model_id,
config_json,
@@ -196,6 +325,7 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
let removed = state.tp_models.remove(&handle);
let was_present = removed.is_some();
drop(removed);
state.tp_kv_snapshots.retain(|(h, _), _| *h != handle);
tracing::debug!(
device_index,
tp_handle = handle.0,
@@ -223,6 +353,89 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
let _ = reply.send(result);
}
#[cfg(feature = "cuda")]
Job::TpSnapshotKv {
handle,
snapshot_id,
reply,
} => {
let result = match state.tp_models.get(&handle) {
Some(model) => {
model
.snapshot_kv_cache()
.map_err(anyhow::Error::from)
.map(|snap| {
let bytes = snap.size_bytes();
state.tp_kv_snapshots.insert((handle, snapshot_id), snap);
tracing::debug!(
device_index,
tp_handle = handle.0,
snapshot_id,
bytes,
stored = state.tp_kv_snapshots.len(),
"device worker: TP kv snapshot captured"
);
bytes
})
}
None => Err(anyhow::anyhow!(
"TpSnapshotKv: no TP model for handle {}",
handle.0
)),
};
let _ = reply.send(result);
}
#[cfg(feature = "cuda")]
Job::TpRestoreKv {
handle,
snapshot_id,
reply,
} => {
let result = match (
state.tp_models.get_mut(&handle),
state.tp_kv_snapshots.get(&(handle, snapshot_id)),
) {
(Some(model), Some(snap)) => {
model.restore_kv_cache(snap).map_err(anyhow::Error::from)
}
(None, _) => Err(anyhow::anyhow!(
"TpRestoreKv: no TP model for handle {}",
handle.0
)),
(_, None) => Err(anyhow::anyhow!(
"TpRestoreKv: no snapshot {} for handle {}",
snapshot_id,
handle.0
)),
};
if result.is_ok() {
trim_device_pool(&state);
}
let _ = reply.send(result);
}
#[cfg(feature = "cuda")]
Job::TpDropKvSnapshot {
handle,
snapshot_id,
reply,
} => {
let was_present = state
.tp_kv_snapshots
.remove(&(handle, snapshot_id))
.is_some();
if was_present {
trim_device_pool(&state);
}
tracing::debug!(
device_index,
tp_handle = handle.0,
snapshot_id,
was_present,
stored = state.tp_kv_snapshots.len(),
"device worker: TP kv snapshot dropped"
);
let _ = reply.send(());
}
#[cfg(feature = "cuda")]
Job::TpForwardLogits {
handle,
tokens,
@@ -232,6 +445,27 @@ pub(crate) fn run(device_index: u32, rx: Receiver<Job>, poisoned: Arc<AtomicBool
let result = tp_forward_logits(&mut state, handle, &tokens, offset);
let _ = reply.send(result);
}
#[cfg(feature = "cuda")]
Job::TpForwardLogitsWithImages {
handle,
tokens,
offset,
image_token_id,
image_data_uris,
chunk_size,
reply,
} => {
let result = tp_forward_logits_with_images(
&mut state,
handle,
&tokens,
offset,
image_token_id,
&image_data_uris,
chunk_size,
);
let _ = reply.send(result);
}
// Handled by the matches!() check above; reaching here
// means a Shutdown slipped past which is a bug.
Job::Shutdown => unreachable!("Shutdown should break above"),
@@ -302,9 +536,12 @@ fn init_state(device_index: u32) -> DeviceWorkerState {
device,
models: HashMap::new(),
next_handle: 1,
kv_snapshots: HashMap::new(),
next_kv_snapshot_id: 1,
nccl: NcclState::new(),
tp_models: HashMap::new(),
next_tp_handle: 1,
tp_kv_snapshots: HashMap::new(),
ctx,
}
}
@@ -315,6 +552,8 @@ fn init_state(device_index: u32) -> DeviceWorkerState {
device: candle_core::Device::Cpu,
models: HashMap::new(),
next_handle: 1,
kv_snapshots: HashMap::new(),
next_kv_snapshot_id: 1,
nccl: NcclState::new(),
}
}
@@ -704,6 +943,61 @@ fn tp_forward_logits(
Ok(values)
}
/// Image-bearing leader forward (rank 0). Preprocesses each source
/// `image_data_uris` entry through the same deterministic
/// `preprocess_data_uri` every rank runs, uploads to the leader's
/// device, encodes + splices + forwards via
/// `TpLeaderModel::forward_with_images`, and copies the `[vocab]`
/// logits to CPU. Mirrors the single-GPU `forward_logits_with_images`
/// but on the TP leader's replicated tower.
#[cfg(feature = "cuda")]
fn tp_forward_logits_with_images(
state: &mut DeviceWorkerState,
handle: TpHandle,
tokens: &[u32],
offset: usize,
image_token_id: u32,
image_data_uris: &[String],
chunk_size: usize,
) -> anyhow::Result<Vec<f32>> {
use crate::harness::preprocess::{PreprocessProfile, preprocess_data_uri};
use candle_core::{DType, Tensor};
if image_data_uris.is_empty() {
anyhow::bail!("TpForwardLogitsWithImages dispatched with zero images");
}
// Preprocess every image into a device-resident (C, H, W) tensor at
// its native-aspect resized dims (#14). Same `smart_resize` + decode
// path the subprocess workers run, so the encoded embeddings — and
// the per-image grids derived from these dims — match across ranks
// bit-for-bit.
let profile = PreprocessProfile::qwen3_6();
let mut pixels: Vec<Tensor> = Vec::with_capacity(image_data_uris.len());
for (idx, uri) in image_data_uris.iter().enumerate() {
let (px, h, w) = preprocess_data_uri(uri, &profile)
.with_context(|| format!("preprocess image[{idx}] (TP leader)"))?;
let t = Tensor::from_vec(px, (3, h as usize, w as usize), &state.device)?;
pixels.push(t);
}
let model = state.tp_models.get_mut(&handle).ok_or_else(|| {
anyhow::anyhow!(
"TpForwardLogitsWithImages: no model for handle {}",
handle.0
)
})?;
// Chunked prefill (encode once, splice per chunk) — bounded
// activation, in lockstep with the subprocess ranks.
let logits =
model.prefill_with_images_chunked(tokens, offset, &pixels, image_token_id, chunk_size)?;
let logits = logits.squeeze(0)?.squeeze(0)?;
let logits = logits.to_dtype(DType::F32)?.flatten_all()?;
let values = logits.to_vec1::<f32>()?;
Ok(values)
}
/// Forward step + copy the `[vocab]` logits to a CPU `Vec<f32>` ready
/// for sampling on the async caller. The model's `device()` (CUDA or
/// CPU) determines where the kernel runs; this fn doesn't care.
@@ -740,6 +1034,114 @@ fn forward_logits(
Ok(values)
}
/// Run the LM forward with vision-tower image splicing. Stage B3.
///
/// Encodes each image through the vision tower (`VisionTower::forward`,
/// dispatched via `ModelArch::encode_image`), concatenates the
/// resulting embeddings into a single `(N_total, hidden)` tensor, and
/// passes it to `ModelArch::forward_with_vision` along with the
/// prompt-expanded `tokens`. Image embeddings never leave the device.
///
/// Returns CPU `[vocab]` logits — same shape contract as
/// `ForwardLogits` so the async sampler doesn't have to branch on the
/// presence of images.
fn forward_logits_with_images(
state: &mut DeviceWorkerState,
handle: ArchHandle,
tokens: &[u32],
offset: usize,
images: Vec<ImageInput>,
image_token_id: u32,
) -> anyhow::Result<Vec<f32>> {
use candle_core::{DType, Tensor};
if images.is_empty() {
anyhow::bail!("ForwardLogitsWithImages dispatched with zero images");
}
// Reconstruct the preprocessed pixels into device-resident
// `(C, H, W)` tensors first (immutable `state.device` borrow), then
// take the `&mut` model borrow for the chunked prefill below.
let mut image_pixels: Vec<Tensor> = Vec::with_capacity(images.len());
for (idx, img) in images.into_iter().enumerate() {
anyhow::ensure!(
img.pixels.len() == img.c * img.h * img.w,
"ForwardLogitsWithImages: image[{idx}] pixels length {} does not match shape ({}, {}, {})",
img.pixels.len(),
img.c,
img.h,
img.w,
);
image_pixels.push(Tensor::from_vec(
img.pixels,
(img.c, img.h, img.w),
&state.device,
)?);
}
let chunk_size = crate::harness::candle::prefill_chunk_tokens();
let arch = state.models.get_mut(&handle).ok_or_else(|| {
anyhow::anyhow!("ForwardLogitsWithImages: no model for handle {}", handle.0)
})?;
// Chunked image prefill (#18): encode once, walk the prompt in
// `chunk_size` windows splicing per-chunk image-pad rows — parity
// with the TP path so a long single-GPU vision context serves
// instead of single-shot OOMing. Returns the final chunk's
// `[vocab]` logits.
let logits = arch
.prefill_with_images_chunked(tokens, offset, &image_pixels, image_token_id, chunk_size)
.context("chunked vision prefill")?;
let values = logits
.to_dtype(DType::F32)?
.flatten_all()?
.to_vec1::<f32>()?;
Ok(values)
}
/// Run the vision tower on a single preprocessed image. Stage A5.
///
/// `pixels` is a row-major `(c, h, w)` f32 image that the async-side
/// `harness::preprocess` produced. We reconstruct the tensor on the
/// worker's device (the same device the model was loaded against),
/// call `arch.encode_image`, and copy the resulting
/// `(N_lm_tokens, hidden_size)` embedding back to CPU f32.
///
/// Returns the flattened embedding as a `Vec<f32>` — the caller knows
/// the LM-side token count from `VisionTower::lm_tokens_for(h, w)`
/// and reshapes accordingly. Stage B introduces a device-resident
/// embedding-slab variant that avoids this round-trip when the next
/// forward call needs the result.
fn encode_image(
state: &mut DeviceWorkerState,
handle: ArchHandle,
pixels: Vec<f32>,
c: usize,
h: usize,
w: usize,
) -> anyhow::Result<Vec<f32>> {
use candle_core::{DType, Tensor};
anyhow::ensure!(
pixels.len() == c * h * w,
"EncodeImage: pixels length {} does not match shape ({c}, {h}, {w})",
pixels.len()
);
let image = Tensor::from_vec(pixels, (c, h, w), &state.device)?;
let arch = state
.models
.get(&handle)
.ok_or_else(|| anyhow::anyhow!("EncodeImage: no model for handle {}", handle.0))?;
let embed = arch.encode_image(&image)?;
let values = embed
.to_dtype(DType::F32)?
.flatten_all()?
.to_vec1::<f32>()?;
Ok(values)
}
/// Reply to a job with the poisoned-worker error. Used when the worker
/// has flipped into drain-only mode after a CUDA driver error.
///
@@ -770,15 +1172,37 @@ fn drain_poisoned(job: Job, device_index: u32) {
Job::ClearKv { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::SnapshotKv { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::RestoreKv { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::DropKvSnapshot { reply, .. } => {
// Same shape as DropArch: unit reply so the caller's await
// resolves; the snapshot leaks with the rest of the slab
// per the poisoned-thread design.
let _ = reply.send(());
}
Job::ForwardLogits { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::EncodeImage { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::ForwardLogitsWithImages { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::NcclInit { reply, .. } => {
let _ = reply.send(crate::harness::tp::rpc::WorkerResponse::Error {
kind: "device_worker_poisoned".into(),
message: format!("device worker {device_index} poisoned"),
});
}
#[cfg(feature = "cuda")]
Job::GetLeaderComm { reply } => {
let _ = reply.send(None);
}
Job::NcclSanity { reply } => {
let _ = reply.send(crate::harness::tp::rpc::WorkerResponse::Error {
kind: "device_worker_poisoned".into(),
@@ -798,9 +1222,27 @@ fn drain_poisoned(job: Job, device_index: u32) {
let _ = reply.send(Err(err()));
}
#[cfg(feature = "cuda")]
Job::TpSnapshotKv { reply, .. } => {
let _ = reply.send(Err(err()));
}
#[cfg(feature = "cuda")]
Job::TpRestoreKv { reply, .. } => {
let _ = reply.send(Err(err()));
}
#[cfg(feature = "cuda")]
Job::TpDropKvSnapshot { reply, .. } => {
// Bookkeeping-only — unit reply so eviction never wedges
// on a poisoned worker (same shape as DropKvSnapshot).
let _ = reply.send(());
}
#[cfg(feature = "cuda")]
Job::TpForwardLogits { reply, .. } => {
let _ = reply.send(Err(err()));
}
#[cfg(feature = "cuda")]
Job::TpForwardLogitsWithImages { reply, .. } => {
let _ = reply.send(Err(err()));
}
Job::Shutdown => {
// Filtered by the matches!() guard in run(); reaching
// here would be a logic error.

Some files were not shown because too many files have changed in this diff Show More