Stage 6b. Third provider impl, completing the wire-format trio
(openai-chat, openai-responses, anthropic-messages). Lets a
helexa-acp endpoint configured with `wire_api = "anthropic-messages"`
drive Claude models — either against Anthropic directly or via
cortex's /v1/messages translation surface.
## Encoder (CompletionRequest → Anthropic body)
- System messages flatten to the top-level `system` field
(concatenated with blank lines when there are multiple).
- User text → `{role:"user", content:"..."}`.
- User MultiPart (text + images) → `content` array with Anthropic's
distinct image shape: `{type:"image", source:{type:"base64",
media_type, data}}` — structurally different from OpenAI's
`image_url` data URI.
- Assistant text → `{role:"assistant", content:"..."}`.
- Assistant tool_calls → `content` array with optional `{type:"text"}`
block plus one `{type:"tool_use", id, name, input:<parsed json>}`
per call. The internal arguments JSON string is parsed back to a
Value before encoding (Anthropic requires the parsed form);
malformed JSON falls back to a String input so the request body
still serialises.
- Tool result → `{role:"user", content:[{type:"tool_result",
tool_use_id, content}]}` per Anthropic's convention (no separate
`tool` role).
- `max_tokens` is required by Anthropic; defaults to 8192 when the
request doesn't specify.
## Decoder (Anthropic SSE → CompletionEvent)
Named SSE events:
- `message_start` → captures input_tokens from `usage` for the
eventual UsageStats.
- `content_block_start` (type=text) → TextDelta (initial text, if any).
- `content_block_start` (type=tool_use) → ToolCallStart; if a
pre-buffered `input` is present, also emits a single
ToolCallArgsDelta.
- `content_block_start` (type=thinking, for extended-thinking
models) → ReasoningDelta.
- `content_block_delta` (text_delta) → TextDelta.
- `content_block_delta` (input_json_delta) → ToolCallArgsDelta,
correlated by block index.
- `content_block_delta` (thinking_delta) → ReasoningDelta.
- `message_delta` → Usage (final output_tokens) + Finish with
stop_reason mapped: end_turn/stop_sequence → "stop", max_tokens
→ "length", tool_use → "tool_calls".
- `message_stop` → stream terminates.
- `ping` ignored (Anthropic's keep-alive).
- `error` → yields Err and ends the stream.
## Wiring
- Authentication: `x-api-key` + `anthropic-version: 2023-06-01`
headers (not Bearer). Both ship when api_key is configured;
servers that don't care (cortex) ignore them.
- `WireApi::AnthropicMessages` in build_provider now constructs
the provider instead of erroring "reserved for future".
- `provider::mod.rs` registers the new module.
18 new unit tests: encoder (system collapse, multi-system concat,
default max_tokens, multipart with image, tool_use blocks, tool
results, malformed JSON arg fallback), decoder (text streaming,
tool_use lifecycle, max_tokens→length mapping, empty deltas, ping
events, error events, cancellation, malformed payload skip,
thinking blocks).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
Inside the daemon, every CUDA device gets one dedicated OS thread
(named cuda-dev-N) that owns the device's CUDA context for the
daemon's lifetime. Model loads, forward passes, KV-cache resets,
NCCL collectives, VRAM queries, and unloads all route through that
thread via a job channel; tensors never escape it alive. This pins
context binding to a known thread, makes the CUDA Drop contract
structurally safe, and isolates driver-error poisoning to one worker
rather than the whole process. See CLAUDE.md for the design
rationale and crates/neuron/src/harness/device_worker/ for the code.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0