Stage B of the vision plan (doc/vision-qwen3_6-spec.md). Wires the vision tower from Stage A through to a complete non-streaming chat completion: extract images from the request, preprocess, encode on the worker thread, splice embeddings into the LM input at `<|image_pad|>` positions, return coherent text response with `prompt_tokens` reflecting patch tokens. Closes the silent-drop class of failures from issue #3 — vision requests against Qwen3.6 now condition the model on the image instead of producing confident text-only hallucinations. Streaming for vision is Stage C. Deferred items tracked under #12 (TP-vision), #13 (27B production), #14 (dynamic resolution), #15 (numerical validation). What landed: - **B1 — `Qwen3_5Model::forward_with_vision`**: text-only `forward` unchanged; new method takes `(input_ids, offset, image_embeds, image_token_id)`, embeds tokens, locates `image_token_id` positions, splices via the new `splice_runs` helper. MRoPE applies text-positions to image tokens for Stage B (spatial MRoPE is the issue #15 numerical-validation follow-up). 2 unit tests for `splice_runs` covering contiguous + non-contiguous runs. - **B2 — `ModelArch::forward_with_vision` dispatch**: routes Qwen3_5Dense to the new method; other arches return an error. Defence-in-depth — the HTTP layer (B6) already rejects image content for non-vision models. - **B3 — `Job::ForwardLogitsWithImages`**: new worker variant carrying tokens + per-image `(pixels, c, h, w)` payloads. The dispatcher encodes each image (device-resident), concatenates the resulting embeddings, calls `arch.forward_with_vision`, and returns CPU logits. Image embeddings never copy back to CPU — the "tensors don't escape the worker" invariant from the per-device worker refactor still holds. Poisoned-worker drain path handles the new variant. - **B4 — Prompt builder**: - `request_has_images` detects image content cheaply. - `extract_images_from_request(request, profile)` walks `MessageContent::Parts`, decodes data URIs, runs `harness::preprocess::preprocess` per image, returns `Vec<ImageInput>` in request order. - `expand_image_pad_tokens(input_ids, image_token_id, patches_per_image)` walks the tokenized prompt and replaces each `<|image_pad|>` (id 248056 for Qwen3.6) with N copies matching the per-image patch count. 4 unit tests. - `VisionMeta::from_config_path` peeks `config.json` at load time for `image_token_id`, vision_config patch/merge sizes, and derives `lm_tokens_per_image` for the Stage B fixed resolution. - **B5 — `chat_completion` vision routing**: detects image content, validates the loaded model has vision, expands the prompt, and calls a new `run_inference_with_images_via_worker` helper that does single-shot prefill + standard decode loop (KV cache holds the post-splice hidden states from prefill, so decode steps don't re-splice). Stage B skips chunked prefill for vision — at 448×448 fixed resolution the budget stays well under the activation-memory threshold. Long-vision chunking is Stage D follow-up. - **B6 — `InferenceError::VisionUnsupported`**: structured 400 with `code=vision_unsupported, model_id, suggestion` when an image request hits a non-vision model. Closes the agent0 failure mode where vision requests degraded silently. - **B7 — `ModelInfo.capabilities`**: per-model array (`["text"]` vs `["text", "vision"]`) in `/v1/models` and forwarded verbatim by cortex-gateway. Lets clients (litellm, agent0) gate image_url submission on the declared capability set. Optional in the wire format; defaults to empty for older clients. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, 124 lib tests). New unit-test counts: +2 splice_runs, +4 expand_image_pad. Manual verification (after RPMs deploy on beast): curl http://hanzalova.internal:31313/v1/chat/completions \ -H 'Content-Type: application/json' \ -d "{\"model\":\"Qwen/Qwen3.6-27B\", \"messages\":[{\"role\":\"user\",\"content\":[ {\"type\":\"text\",\"text\":\"What's in this image?\"}, {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,...\"}} ]}], \"max_tokens\":120}" | jq Expect prompt_tokens > 196 (text + 196 patch tokens) and a response that references actual image content. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
Inside the daemon, every CUDA device gets one dedicated OS thread
(named cuda-dev-N) that owns the device's CUDA context for the
daemon's lifetime. Model loads, forward passes, KV-cache resets,
NCCL collectives, VRAM queries, and unloads all route through that
thread via a job channel; tensors never escape it alive. This pins
context binding to a known thread, makes the CUDA Drop contract
structurally safe, and isolates driver-error poisoning to one worker
rather than the whole process. See CLAUDE.md for the design
rationale and crates/neuron/src/harness/device_worker/ for the code.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0