Two coupled bugs surfaced after 9b0ed0b:
1. url::Url::parse("http://host:port").to_string() normalises the
empty path to "/", so rewrite_loopback_host was returning
"http://beast:13131/". Downstream callers then did
format!("{endpoint}/v1/chat/completions") and produced a
double-slash path that neuron's axum router 404'd with an empty
body. Strip the trailing slash in the rewriter so the endpoint is
a clean base string for concatenation.
2. The anthropic_messages handler returned the upstream's empty body
to the API caller as `"upstream error: "` with no journal log on
the cortex side. Operators had no way to see what happened. Add
warn-level tracing on both upstream failure paths (network error
and non-2xx) with model, node, target URL, status, and a 512-char
body snippet. The API response now carries just `"upstream
returned <status>"` — the implementation detail lives in the log.
Updates the two existing rewrite tests for the no-trailing-slash
output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Neuron hardcodes its bind_url as `http://localhost:13131` (it can't
reliably know its own externally-resolvable name). When cortex runs
on a different host than the neuron it's routing to, blindly
proxying to that URL hits localhost on the cortex box instead of the
neuron.
Cortex already knows each neuron's reachable host from cortex.toml.
After fetching the inference URL from `/models/{id}/endpoint`, if
the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1),
swap it for the configured neuron host. Preserve the port and path
from neuron's URL so a future harness serving inference on a
different port than the management API still works.
Adds `url` (already a transitive dep via reqwest) as a direct
dep for the URL parsing.
Tests cover: localhost rewrite, distinct inference port preservation,
non-loopback passthrough, malformed input.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 3 of the candle-native pivot. neuron now serves
POST /v1/chat/completions backed by candle's quantized_qwen3 forward
pass on a per-model serialised generation loop, returning the standard
OpenAI ChatCompletionResponse envelope.
Pipeline per request:
- Look up the LoadedModel by request.model (404 if absent).
- Apply the Qwen3 chat template across all messages.
- Tokenize, then spawn_blocking onto tokio's blocking pool to acquire
the per-model arch lock and run prefill + greedy/temperature/top-p
sampling via LogitsProcessor.
- Stop on <|im_end|>/<|endoftext|> EOS or max_tokens (finish_reason
"stop" vs "length").
- Decode with skip_special_tokens=true, build OpenAI response with
prompt/completion/total usage counts.
Supporting changes:
- HarnessRegistry now stores Arc<dyn Harness> and caches a typed
Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch.
- LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard
can be moved into spawn_blocking.
- NeuronState gains an Option<Arc<CandleHarness>> field for the new
inference route.
- Typed InferenceError lets the handler map ModelNotLoaded → 404 and
other failures → 500 without string-matching anyhow messages.
- stream=true returns 501 until Stage 4 wires up SSE.
- Two leftover mistral.rs string references in proxy.rs and cortex-cli
(missed during the Stage 1 sweep) are corrected here.
Three new default-feature tests cover the no-candle 503, model-not-
loaded 404, and stream=true 501 paths. The cuda-integration test from
Stage 2 still covers real load/unload; a streaming-feature gated test
exercising actual generation will arrive with Stage 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 1 of the candle-native pivot. Replaces the external-process
harness model (mistralrs over HTTP, llamacpp placeholder) with an
in-process Harness trait whose sole implementation is candle. The
trait keeps its shape so future engines slot in additively, but
start/stop default to no-ops and HarnessConfig drops endpoint and
systemd_unit since no harness needs external supervision.
Behaviour is unchanged on the wire: load_model returns a "not
implemented yet (Stage 2)" error and list_models is empty. The
gateway-side proxy, poller, and router are untouched.
CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are
marked superseded; the staged plan lives in
~/.claude/plans/create-a-more-aggressive-calm-naur.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace NodeConfig (static vram_mb, pinned) with NeuronEndpoint.
Hardware discovery and model pinning now come from neuron API and
models.toml catalogue respectively.
- config.rs: nodes -> neurons, add models_config path
- catalogue.rs: ModelProfile with pinned_on, ModelCatalogue
- poller.rs: poll neuron GET /models (ModelInfo format)
- router.rs: resolve inference endpoint via neuron GET /models/{id}/endpoint
- evictor.rs: call neuron POST /models/unload
- node.rs: remove vram_mb, pinned fields (come from discovery/catalogue)
- All 22 gateway tests updated to mock neuron API
- Remove MistralModelsResponse, ModelLifecycleRequest (no longer needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit cortex_requests_total, cortex_request_duration_seconds,
cortex_request_errors_total, and cortex_cold_starts_total with
model and node labels on every proxied request.
Add install_test_recorder() for testing metrics without HTTP listener.
Integration test verifies counters and histograms appear after proxy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire up openai_to_anthropic in the /v1/messages handler: buffer
upstream OpenAI response, parse, translate to Anthropic format
(stop_reason mapping, usage field names, content blocks).
5 integration tests covering round-trip translation, system prompt,
content blocks, and error cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract public poll_once() from poll_loop() for testability.
4 tests proving the poller correctly discovers models, updates
gateway state, marks unreachable nodes unhealthy, and prunes
stale models.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirms the existing proxy streams SSE chunks incrementally:
- 5-chunk test with 50ms delays verifies time spread between first
and last chunk arrival (not buffered)
- Verifies data: [DONE] terminator is forwarded
No src/ changes needed — Body::from_stream(bytes_stream()) already
handles SSE correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tests proving the scaffold works end-to-end:
- chat completion proxied through gateway to mock backend
- /health endpoint with healthy node
- /v1/models returns seeded model list
- 404 for unknown model
- 404 when no healthy nodes available
- 400 when request body missing model field
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .gitea/workflows/ci.yml with fmt/clippy/test on all branches
and SRPM build + COPR publish on version tags
- Add cortex.spec for Fedora RPM packaging
- Add GPL-3.0-or-later LICENSE file
- Add cortex.example.toml with generic hostnames; gitignore cortex.toml
- Scrub infrastructure-specific hostnames from README.md, CLAUDE.md,
and doc comments
- Fix unused imports and clippy warnings to pass -D warnings
- Fix missing deps (bytes, reqwest, serde_json) exposed during build
- Run cargo fmt across workspace
- Update SPDX license identifier to GPL-3.0-or-later
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>