rob thijssen 9b0ed0b57f
Some checks failed
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 41s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Clippy (push) Successful in 7m25s
build-prerelease / Build neuron-ampere (push) Successful in 4m57s
build-prerelease / Build cortex binary (push) Successful in 4m15s
build-prerelease / Build neuron-ada (push) Successful in 5m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
CI / Test (push) Failing after 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
fix(router): rewrite loopback inference URLs to use neuron's host
Neuron hardcodes its bind_url as `http://localhost:13131` (it can't
reliably know its own externally-resolvable name). When cortex runs
on a different host than the neuron it's routing to, blindly
proxying to that URL hits localhost on the cortex box instead of the
neuron.

Cortex already knows each neuron's reachable host from cortex.toml.
After fetching the inference URL from `/models/{id}/endpoint`, if
the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1),
swap it for the configured neuron host. Preserve the port and path
from neuron's URL so a future harness serving inference on a
different port than the management API still works.

Adds `url` (already a transitive dep via reqwest) as a direct
dep for the URL parsing.

Tests cover: localhost rewrite, distinct inference port preservation,
non-loopback passthrough, malformed input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:23:47 +03:00

cortex

A Rust reverse-proxy and fleet management layer for multi-node GPU inference clusters. Cortex sits in front of one or more neuron daemons (each running candle-based inference on a local GPU host) and presents a unified OpenAI + Anthropic compatible API surface.

Problem

Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:

  • Presents a single /v1/models catalogue merging every model that can be served by any neuron in the fleet.
  • Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
  • Manages model lifecycle — load on demand, unload cold models, pin critical ones — by calling each neuron's /models/{load,unload} API.
  • Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
  • Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.

Architecture

┌──────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
│ Claude Code  │  │ Zed/IDE  │  │ Tidal / mm │  │ curl / etc │
└──────┬───────┘  └─────┬────┘  └──────┬─────┘  └──────┬─────┘
       │                │              │               │
       └────────────────┴──────┬───────┴───────────────┘
                               │
                    ┌──────────▼──────────┐
                    │      cortex         │
                    │  (cortex-gateway)   │
                    │                     │
                    │  Router · Metrics   │
                    │  Evictor · Translate│
                    └──┬──────┬────────┬──┘
                       │      │        │
            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
            │  neuron   │  │ neuron │  │  neuron   │
            │  :13131   │  │ :13131 │  │  :13131   │
            │  candle   │  │ candle │  │  candle   │
            └───────────┘  └────────┘  └───────────┘
                  private network (.internal)

Crates

Crate Purpose
cortex-core Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types
cortex-gateway Axum HTTP server: proxy, router, evictor, poller, metrics exporter
neuron Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API
cortex-cli CLI entrypoint (cortex serve, cortex status, etc.)

Node setup

Each GPU node runs neuron (listening on :13131). Neuron uses huggingface/candle for in-process inference — there is no external inference subprocess to manage.

The neuron RPM (helexa-neuron) ships a systemd unit:

dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron

Gateway config

# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"

[eviction]
strategy = "lru"        # lru | priority
defrag_after_cycles = 50

[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"

[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"

Model placement profiles live in models.toml — see models.example.toml.

Building

cargo build --release

CI

Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:

cargo fmt --check --all                    # must be clean
cargo clippy --workspace -- -D warnings   # warnings are errors
cargo test --workspace                     # all tests must pass

Tagged releases (v*) additionally build SRPMs for both cortex and helexa-neuron and publish to COPR.

Running

# start the gateway
cortex serve --config /etc/cortex/cortex.toml

# check fleet status
cortex status

# list all models across nodes
curl http://localhost:31313/v1/models

License

GPL-3.0

Description
No description provided
Readme GPL-3.0 1.9 MiB
Languages
Rust 90.6%
Cuda 4.6%
Shell 3.9%
Python 0.9%