Go to file

ci: commit generated %changelog entries back to main

Previously the srpm-* jobs generated a fresh %changelog entry and
shipped it to COPR, but the version-stamped spec pushed back to main
by the bump-version job only updated the Version: line — not the
%changelog section. The result: SRPM and in-tree spec diverged and
a fresh clone of the repo showed a perpetually empty changelog.

Run the rpm-changelog action in bump-version too. Now the committed
specs track the SRPMs: each release leaves a dated %changelog entry
in main covering commits since the previous tag, visible in git log
and in the repo's spec browser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 16:37:03 +03:00

.gitea/workflows

ci: commit generated %changelog entries back to main

2026-04-16 16:37:03 +03:00

crates

refactor: rename cortex-neuron binary and crate to neuron

2026-04-15 15:51:15 +03:00

data

fix(neuron): run service as neuron user, not cortex

2026-04-16 13:32:36 +03:00

.gitignore

feat: add neuron daemon with GPU discovery and health endpoints

2026-04-15 14:23:42 +03:00

Cargo.lock

chore: bump version to 0.1.12

2026-04-16 15:47:21 +03:00

Cargo.toml

chore: bump version to 0.1.12

2026-04-16 15:47:21 +03:00

CLAUDE.md

ci: add RPM packaging for cortex and neuron

2026-04-15 16:09:04 +03:00

cortex.example.toml

feat: scaffold cortex workspace

2026-04-14 18:13:30 +03:00

cortex.spec

chore: bump version to 0.1.12

2026-04-16 15:47:21 +03:00

LICENSE

ci: add Gitea CI, RPM spec, license, and repo hygiene

2026-04-14 18:24:04 +03:00

models.example.toml

ci: add RPM packaging for cortex and neuron

2026-04-15 16:09:04 +03:00

neuron.example.toml

fix(neuron): install config at /etc/neuron/, not /etc/cortex/

2026-04-16 13:07:06 +03:00

neuron.spec

chore: bump version to 0.1.12

2026-04-16 15:47:21 +03:00

README.md

docs: add CI expectations to CLAUDE.md and README.md

2026-04-14 18:27:17 +03:00

README.md

cortex

A Rust reverse-proxy and fleet management layer for multi-node mistral.rs inference clusters.

Problem

Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:

Presents a single /v1/models catalogue merging every model across every node.
Routes requests to the correct node based on where a model is loaded (or can be loaded).
Manages model lifecycle — unload cold models, reload on demand, pin critical ones — using the mistral.rs /v1/models/{unload,reload,status} HTTP API (PR #1828+).
Translates between OpenAI and Anthropic request/response envelopes so every client in the homelab speaks whichever dialect it prefers.
Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.

Architecture

┌──────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
│ Claude Code  │  │ Zed/IDE  │  │ Tidal / mm │  │ curl / etc │
└──────┬───────┘  └─────┬────┘  └──────┬─────┘  └──────┬─────┘
       │                │              │               │
       └────────────────┴──────┬───────┴───────────────┘
                               │
                    ┌──────────▼──────────┐
                    │   cortex     │
                    │   (cortex-gateway)      │
                    │                     │
                    │  Router · Metrics   │
                    │  Evictor · Translate│
                    └──┬──────┬────────┬──┘
                       │      │        │
            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
            │ gpu-large │  │gpu-med │  │ gpu-small │
            │ mistralrs │  │mistral │  │ mistralrs │
            │ serve     │  │rs serve│  │ serve     │
            │ :8080     │  │ :8080  │  │  :8080    │
            └───────────┘  └────────┘  └───────────┘
                  private network (.internal)

Crates

Crate	Purpose
`cortex-core`	Shared types: config, node/model state, metrics, OpenAI/Anthropic request/response envelopes
`cortex-gateway`	Axum HTTP server: proxy, router, evictor, metrics exporter
`cortex-agent`	Per-node sidecar: polls local mistralrs, reports to gateway, handles restart/defrag
`cortex-cli`	CLI entrypoint (`cortex serve`, `cortex status`, etc.)

Node setup

Each GPU node runs mistralrs serve with a multi-model config. Models are declared but start unloaded — mistral.rs lazy-loads on first request and the gateway can explicitly unload/reload via the HTTP API.

Example node systemd unit:

# /etc/systemd/system/mistralrs.service
[Unit]
Description=mistral.rs inference server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/mistralrs serve \
    --from-config /etc/mistralrs/config.toml \
    --port 8080
Restart=on-failure
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0,1

[Install]
WantedBy=multi-user.target

Gateway config

# cortex.toml
[gateway]
listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:9100"

[eviction]
strategy = "lru"        # lru | priority
defrag_after_cycles = 50

[[nodes]]
name = "gpu-large"
endpoint = "http://gpu-large.internal:8080"
vram_mb = 49_152        # e.g. 2x RTX 4090
pinned = ["your-org/large-model"]

[[nodes]]
name = "gpu-medium"
endpoint = "http://gpu-medium.internal:8080"
vram_mb = 24_576        # e.g. RTX 4090
pinned = ["your-org/medium-model"]

[[nodes]]
name = "gpu-small"
endpoint = "http://gpu-small.internal:8080"
vram_mb = 12_288        # e.g. RTX 3060
pinned = ["your-org/embedding-model"]

Building

cargo build --release

CI

Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:

cargo fmt --check --all                    # must be clean
cargo clippy --workspace -- -D warnings   # warnings are errors
cargo test --workspace                     # all tests must pass

Tagged releases (v*) additionally build an SRPM and publish to COPR.

Running

# start the gateway
cortex serve --config cortex.toml

# check fleet status
cortex status

# list all models across nodes
curl http://localhost:8000/v1/models

License

GPL-3.0