Two changes addressing operator visibility into TP inference + the
HTTP-cancellation poisoning chain:
1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When
the HTTP client disconnects (curl --max-time, browser nav, etc.)
the future returned from `chat_completion_tp` gets dropped, but
the spawned task keeps running to completion — finishing every
`pool.generate_step` / `pool.clear_kv_cache` to drain the worker
pipes. The next inference request then finds a clean pool.
Previously: dropped future left workers still processing the
in-flight request, the next call's `ClearKvCache` recv would
read the stale `GenerateStepOk` from the abandoned step ("rank N
expected KvCacheCleared, got GenerateStepOk"). The drain-on-
leader-error fix from d1a4aad covered Rust-side leader failures
but not HTTP-layer cancellation, which is what we actually hit
on the user's Qwen3.6 test.
2. Tracing throughout the TP path so journalctl shows where an
inference spends its time without needing to surface harness
internals via the HTTP error body:
- `chat_completion_tp_inner` (now a free fn so it can run inside
spawn): `info` at request start (prompt_len, max_new, temp,
top_p, eos_id), `info` per major phase (prefill complete with
elapsed_ms, decode complete with elapsed_ms + token count),
`info` at completion (total_ms, finish_reason). `debug` for
pool-lock acquisition + kv-cache clear timing. `trace` per
decode step (next_token, step_ms).
- `WorkerPool::generate_step` (leader side): `debug` at fan-out,
`debug` after leader forward returns with elapsed_ms + ok flag,
`debug` after drain with errors count + total_ms.
- `WorkerPool::clear_kv_cache`: matching `debug` at fan-out + drain.
- `worker::handle_generate_step`: `debug` at forward start + done
with elapsed_ms, `warn` on forward failure with the full error.
The default log filter is already `info,neuron=debug` so the
operator gets every `info` and `debug` line by default; `trace`
needs RUST_LOG=trace for per-step decode timing.
Stage 7c-ii crash-detection is still future work; this is the
minimum that makes the "where did the 120s go" question answerable
from the logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0