Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the
Qwen3-Next architecture — plus the dispatch wiring needed to route a
load through it on both the leader and the workers.
Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for
the full-attention layers + a new pattern for linear-attention):
- TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size`
V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`,
`in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along
the V-head dim. `out_proj` is row-parallel + AllReduce (the only
collective inside the block). The recurrent state shards 1:1 with
V-heads — no cross-rank sync inside the delta-rule loop.
`in_proj_qkv` and `conv1d.weight` are FUSED tensors with three
regions along dim 0 (`[first key_dim, second key_dim, value_dim]`).
Standard uniform-slicing doesn't align with the head boundaries —
rank 0 would end up with `[first half of K_0, full K_1, first half
of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full
tensor, narrow per-region per-rank, and `Tensor::cat` the three
slices into a per-rank fused weight. Transient peak of one full
tensor per layer during construction; net memory is properly per-
rank after the full drops.
- TpQwen3_5Attention: column-parallel `q_proj` (the widened
`2 * num_heads * head_dim` output, including the gate half — shards
along the head axis so both query AND gate halves stay consistent
per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce.
Otherwise mirrors `tp_qwen3.rs`'s attention.
- TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types),
TpQwen3_5Model (with `model.language_model.*` prefix), and
TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level).
Dispatch wiring:
- New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant.
`WorkerPool::load_dense_shard` now dispatches on `model_type` from
the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two
downstream methods (`generate_step`, `clear_kv_cache`) thread this
enum through — the inner forward+clear_kv_cache dispatch happens
via the enum's pub methods. Adding another TP architecture later is
one more enum variant + match arms.
- Worker side gets a parallel `WorkerModel` enum + dispatch in
`handle_load_dense_shard`, branching on the same `model_type`.
- Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`.
`TpLoadedModel.leader_model` retyped to the enum.
Helpers in `arch/qwen3_5/linear_attn.rs`:
- `softplus` and `repeat_interleave` made `pub(crate)` so the TP
module reuses them rather than duplicating.
Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated
`Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial
RoPE with `partial_rotary_factor` already correct).
CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda`
also clean inside the patched runner container.
Single inflight risk to call out: tensor names. For full-attention
layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.*`
and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*`
— the same as the single-GPU path. lm_head sits at the top level (not
under `language_model`) — consistent with the single-GPU path that
validated against Qwen3.5-0.8B.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0