Step 3 of the Responses rollout: plain proxy route on the gateway,
no translation. Neuron speaks the Responses API natively after Step
2 (commit 957f704), so the gateway just needs the same routing
shape it uses for /v1/chat/completions — extract `model`, resolve
via router::resolve, forward verbatim.
- New `POST /v1/responses` handler in handlers.rs::responses.
- Mock neuron under tests/common/mod.rs gains a `/v1/responses`
endpoint that mirrors the ResponsesResponse shape neuron emits.
- New integration test file `tests/responses.rs` exercises:
- Happy path (200, body round-trips, ResponsesUsage shape).
- Unknown model → 404 (matches chat-completions error shape).
- Missing `model` field → 400 (same extract_model helper).
Streaming proxy works through the same path as chat completions —
the upstream Content-Type (`text/event-stream` for stream:true,
`application/json` otherwise) propagates through proxy_with_metrics
unchanged. Live-stream integration tests against a streaming mock
deferred until we exercise the path against a real neuron, since
the chat-completions streaming test already covers the proxy's
SSE forwarding mechanics.
Three new tests; clippy + fmt clean across the workspace.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Operators can now define tier aliases in models.toml:
[aliases]
"helexa/small" = "Qwen/Qwen3-1.7B"
"helexa/balanced" = "Qwen/Qwen3-8B"
"helexa/large" = "Qwen/Qwen3.6-27B"
A client request for `model: "helexa/small"` is resolved to the concrete
model id at routing time. The gateway also rewrites the proxied body's
`model` field to the concrete id so neuron sees a name that matches its
loaded handle (otherwise the harness rejects the request).
Motivated by the finger-in-the-wind benchmark: same "what's the capital
of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with
identical correctness. Aliases let clients pick a latency tier without
hardcoding model ids, and let operators swap targets without changing
client code.
Changes:
* cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>`
+ `resolve_alias(&str) -> &str`. Unit tests cover the basic
resolution + TOML round-trip.
* cortex-gateway:
* `RouteDecision` gains `resolved_model_id: String`. `router::resolve`
consumes aliases at entry and threads the concrete id through.
* Handlers (chat_completions, completions, anthropic_messages
streaming + non-streaming) rewrite the body's `model` field with
`rewrite_model_in_body` before proxying, using the resolved id
for metrics labels, LRU touch, and the body itself.
* `/v1/models` (Pass 4) emits each alias as its own entry mirroring
the target's `loaded` flag, feasible_on, and locations — clients
browsing the endpoint see both names and can pick either.
* `models.toml` declares the three tier aliases; `models.example.toml`
documents the section as opt-in.
* Integration tests verify: end-to-end alias→concrete request flow,
alias surfacing in /v1/models, and no-op fall-through for
non-alias model ids.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The poller now fetches /health alongside /models on each neuron and
stashes the activation snapshot on NodeState. The /v1/models handler
gains a Pass 3 that synthesises Loading locations from each neuron's
activation.in_progress and activation.pending lists, so a catalogued
model that's mid-prewarm surfaces as `status: "loading"` rather than
appearing absent (loaded=false, locations=[]).
Without this, a client polling /v1/models during a beast restart sees
Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then
reappear. Now it stays visible the whole time with a clear status.
Adds ModelStatus::Loading to cortex-core. The router's per-node priority
loop gets an explicit (no-op) arm: Loading models aren't routable yet,
and falling through to the catalogue cold-load path is the existing
race — no worse than before, but tagged as a known follow-up needing
neuron-side in-flight tracking on /models/load.
New test_poller_captures_activation_from_health exercises the full
round-trip: mock neuron with empty /models but a pre_warming /health
→ poller writes node.activation. Common test helpers gain
spawn_mock_neuron_with_models_and_health and default_health_response.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 1 of the candle-native pivot. Replaces the external-process
harness model (mistralrs over HTTP, llamacpp placeholder) with an
in-process Harness trait whose sole implementation is candle. The
trait keeps its shape so future engines slot in additively, but
start/stop default to no-ops and HarnessConfig drops endpoint and
systemd_unit since no harness needs external supervision.
Behaviour is unchanged on the wire: load_model returns a "not
implemented yet (Stage 2)" error and list_models is empty. The
gateway-side proxy, poller, and router are untouched.
CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are
marked superseded; the staged plan lives in
~/.claude/plans/create-a-more-aggressive-calm-naur.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace NodeConfig (static vram_mb, pinned) with NeuronEndpoint.
Hardware discovery and model pinning now come from neuron API and
models.toml catalogue respectively.
- config.rs: nodes -> neurons, add models_config path
- catalogue.rs: ModelProfile with pinned_on, ModelCatalogue
- poller.rs: poll neuron GET /models (ModelInfo format)
- router.rs: resolve inference endpoint via neuron GET /models/{id}/endpoint
- evictor.rs: call neuron POST /models/unload
- node.rs: remove vram_mb, pinned fields (come from discovery/catalogue)
- All 22 gateway tests updated to mock neuron API
- Remove MistralModelsResponse, ModelLifecycleRequest (no longer needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit cortex_requests_total, cortex_request_duration_seconds,
cortex_request_errors_total, and cortex_cold_starts_total with
model and node labels on every proxied request.
Add install_test_recorder() for testing metrics without HTTP listener.
Integration test verifies counters and histograms appear after proxy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire up openai_to_anthropic in the /v1/messages handler: buffer
upstream OpenAI response, parse, translate to Anthropic format
(stop_reason mapping, usage field names, content blocks).
5 integration tests covering round-trip translation, system prompt,
content blocks, and error cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract public poll_once() from poll_loop() for testability.
4 tests proving the poller correctly discovers models, updates
gateway state, marks unreachable nodes unhealthy, and prunes
stale models.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirms the existing proxy streams SSE chunks incrementally:
- 5-chunk test with 50ms delays verifies time spread between first
and last chunk arrival (not buffered)
- Verifies data: [DONE] terminator is forwarded
No src/ changes needed — Body::from_stream(bytes_stream()) already
handles SSE correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tests proving the scaffold works end-to-end:
- chat completion proxied through gateway to mock backend
- /health endpoint with healthy node
- /v1/models returns seeded model list
- 404 for unknown model
- 404 when no healthy nodes available
- 400 when request body missing model field
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>