fdc0adb738725a2e2291fb84924aa067f521d59a
11 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
957f704efa
|
feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.
## Surface
- `cortex-core::responses` envelope types: `ResponsesRequest`,
`ResponsesInput` (text | items), `ResponsesInputItem` (message |
function_call | function_call_output | reasoning),
`ResponsesContentPart` (input_text | input_image | output_text),
`ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
a `events::*` constant module so the projector and the wire shape
stay in sync without string-typos.
- `neuron::wire::openai_responses`:
- `request_to_chat(req)` flattens Responses input + instructions
into a `ChatCompletionRequest` the candle harness already
understands. Text-only Parts collapse to a string; mixed
text+image Parts go to chat's content-array shape; reasoning
items drop; function_call / function_call_output round-trip
via tool_calls / tool_call_id metadata so the surface is
consistent for the day the harness emits tool calls.
- `project_responses_stream(rx, meta)` reads InferenceEvents
and emits the eight named events that compose a Responses
stream: response.created → output_item.added → content_part.added
→ output_text.delta×N → output_text.done → content_part.done
→ output_item.done → response.completed. Synthesises start
frames if the producer skips Start (poisoned model, early
disconnect) so the stream stays coherent.
- `build_response(meta, text, reason, usage)` for the
non-streaming path.
- `CandleHarness::inference_stream(req)` extracted from
`chat_completion_stream`, returning a typed `InferenceStream`
(event receiver + id/created/model_id metadata). Both
`chat_completion_stream` and the new `responses_stream` are now
thin wrappers that pick their wire projection. TP path got the
same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).
- `POST /v1/responses` route on neuron. Non-streaming returns one
buffered `ResponsesResponse`; streaming returns axum SSE with
both event names and JSON data per frame (Responses, unlike
chat completions, uses named `event:` lines). Reused
`inference_error_response` helper hoisted out so the chat and
responses handlers share the InferenceError → HTTP mapping.
## CI
Also bundles the `cuda-check` runner-label fix from feedback on
commit
|
|||
|
24e20dcb5c
|
feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m21s
CI / Test (push) Successful in 4m40s
build-prerelease / Build neuron-blackwell (push) Successful in 3m38s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build neuron-ada (push) Successful in 4m45s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 9m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Operators can now define tier aliases in models.toml:
[aliases]
"helexa/small" = "Qwen/Qwen3-1.7B"
"helexa/balanced" = "Qwen/Qwen3-8B"
"helexa/large" = "Qwen/Qwen3.6-27B"
A client request for `model: "helexa/small"` is resolved to the concrete
model id at routing time. The gateway also rewrites the proxied body's
`model` field to the concrete id so neuron sees a name that matches its
loaded handle (otherwise the harness rejects the request).
Motivated by the finger-in-the-wind benchmark: same "what's the capital
of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with
identical correctness. Aliases let clients pick a latency tier without
hardcoding model ids, and let operators swap targets without changing
client code.
Changes:
* cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>`
+ `resolve_alias(&str) -> &str`. Unit tests cover the basic
resolution + TOML round-trip.
* cortex-gateway:
* `RouteDecision` gains `resolved_model_id: String`. `router::resolve`
consumes aliases at entry and threads the concrete id through.
* Handlers (chat_completions, completions, anthropic_messages
streaming + non-streaming) rewrite the body's `model` field with
`rewrite_model_in_body` before proxying, using the resolved id
for metrics labels, LRU touch, and the body itself.
* `/v1/models` (Pass 4) emits each alias as its own entry mirroring
the target's `loaded` flag, feasible_on, and locations — clients
browsing the endpoint see both names and can pick either.
* `models.toml` declares the three tier aliases; `models.example.toml`
documents the section as opt-in.
* Integration tests verify: end-to-end alias→concrete request flow,
alias surfacing in /v1/models, and no-op fall-through for
non-alias model ids.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
b9e7a76a7a
|
feat(gateway): surface mid-prewarm models as Loading on /v1/models
The poller now fetches /health alongside /models on each neuron and stashes the activation snapshot on NodeState. The /v1/models handler gains a Pass 3 that synthesises Loading locations from each neuron's activation.in_progress and activation.pending lists, so a catalogued model that's mid-prewarm surfaces as `status: "loading"` rather than appearing absent (loaded=false, locations=[]). Without this, a client polling /v1/models during a beast restart sees Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then reappear. Now it stays visible the whole time with a clear status. Adds ModelStatus::Loading to cortex-core. The router's per-node priority loop gets an explicit (no-op) arm: Loading models aren't routable yet, and falling through to the catalogue cold-load path is the existing race — no worse than before, but tagged as a known follow-up needing neuron-side in-flight tracking on /models/load. New test_poller_captures_activation_from_health exercises the full round-trip: mock neuron with empty /models but a pre_warming /health → poller writes node.activation. Common test helpers gain spawn_mock_neuron_with_models_and_health and default_health_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
800498f530
|
feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.
1. main.rs no longer awaits load_default_models before binding axum.
The listener binds first; pre-warm runs in a spawned background
task that holds a read lock on the harness registry for the
duration of its sequential load loop. Concurrent on-demand
/models/load and /v1/chat/completions traffic still flow.
2. /health gains an `activation` field carrying:
state pre_warming | ready
pending model ids queued but not started
in_progress model id currently loading (Option)
completed model ids loaded successfully this activation
failed [{model_id, error}] for failed entries
The field is `#[serde(default)]` so a pre-change cortex polling a
new neuron — or vice versa — keeps working.
`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.
Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
735945ee81
|
feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
3cccc2c56b
|
refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness
Stage 1 of the candle-native pivot. Replaces the external-process harness model (mistralrs over HTTP, llamacpp placeholder) with an in-process Harness trait whose sole implementation is candle. The trait keeps its shape so future engines slot in additively, but start/stop default to no-ops and HarnessConfig drops endpoint and systemd_unit since no harness needs external supervision. Behaviour is unchanged on the wire: load_model returns a "not implemented yet (Stage 2)" error and list_models is empty. The gateway-side proxy, poller, and router are untouched. CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are marked superseded; the staged plan lives in ~/.claude/plans/create-a-more-aggressive-calm-naur.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
3f94c50817
|
chore: move default ports out of common-collision ranges
Previous defaults collided with well-trodden infra services and with the Linux ephemeral port range: - cortex API 8000 — common dev-server default (Django, minio UI) - cortex metrics 9100 — Prometheus node_exporter default - neuron API 9090 — Cockpit default on Fedora, Prometheus self Move to helexa-themed palindromic ports, all below Linux's 32768-60999 ephemeral range and not registered to any well-known service: - cortex API 31313 - cortex metrics 31314 - neuron API 13131 Updated places: - cortex.example.toml, neuron.example.toml defaults - default impls in cortex-core and neuron config - cortex-cli --endpoint default for the status subcommand - doc comments citing example URLs - README.md and CLAUDE.md snippets Consumers already on the old ports need a one-line edit in their /etc/cortex/cortex.toml or /etc/neuron/neuron.toml to match; firewall rules and prometheus scrape configs will also need updating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
|||
|
e42e8ee81f
|
refactor: cortex talks to neurons instead of mistral.rs directly
Replace NodeConfig (static vram_mb, pinned) with NeuronEndpoint.
Hardware discovery and model pinning now come from neuron API and
models.toml catalogue respectively.
- config.rs: nodes -> neurons, add models_config path
- catalogue.rs: ModelProfile with pinned_on, ModelCatalogue
- poller.rs: poll neuron GET /models (ModelInfo format)
- router.rs: resolve inference endpoint via neuron GET /models/{id}/endpoint
- evictor.rs: call neuron POST /models/unload
- node.rs: remove vram_mb, pinned fields (come from discovery/catalogue)
- All 22 gateway tests updated to mock neuron API
- Remove MistralModelsResponse, ModelLifecycleRequest (no longer needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|||
|
6dc717ebcd
|
feat: add neuron daemon with GPU discovery and health endpoints
Replace cortex-agent stub with neuron (cortex-neuron binary). cortex-core additions: - discovery.rs: DeviceInfo, DiscoveryResponse, DeviceHealth, HealthResponse - harness.rs: Harness async trait, HarnessConfig, ModelSpec, ModelInfo neuron crate (crates/neuron/): - discovery.rs: nvidia-smi CSV parsing (pure functions) + system discovery via uname/nvidia-smi/nvcc - health.rs: cached GPU health polling every 5s - api.rs: GET /discovery and GET /health axum handlers - main.rs: CLI entrypoint with --port flag (default 9090) - harness stubs for mistralrs (Phase 8) and llamacpp (Phase 11) 12 new tests (9 unit + 3 integration), 35 total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
|||
|
6bb3004cfc
|
ci: add Gitea CI, RPM spec, license, and repo hygiene
- Add .gitea/workflows/ci.yml with fmt/clippy/test on all branches and SRPM build + COPR publish on version tags - Add cortex.spec for Fedora RPM packaging - Add GPL-3.0-or-later LICENSE file - Add cortex.example.toml with generic hostnames; gitignore cortex.toml - Scrub infrastructure-specific hostnames from README.md, CLAUDE.md, and doc comments - Fix unused imports and clippy warnings to pass -D warnings - Fix missing deps (bytes, reqwest, serde_json) exposed during build - Run cargo fmt across workspace - Update SPDX license identifier to GPL-3.0-or-later Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
|||
|
0da68833af
|
feat: scaffold cortex workspace
Rust reverse-proxy for multi-node mistral.rs inference clusters. Includes crate structure (cortex-core, cortex-gateway, cortex-agent, cortex-cli), config loading, OpenAI/Anthropic translation stubs, model routing, eviction, polling, and streaming proxy scaffolding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |