From 135601d8aace3f0d4860d1cef851b9c5183ed401 Mon Sep 17 00:00:00 2001 From: rob thijssen Date: Sun, 14 Jun 2026 15:53:54 +0300 Subject: [PATCH] docs(generic): refresh GPU/inference for the helexa neuron+cortex stack MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §11 still described mistral.rs on :1234 and a "planned" cortex.internal. Update to current reality: candle-native neuron daemons on :13131 per GPU host, fronted by the cortex gateway (live at hanzalova.internal:31313 / :31314, unified OpenAI+Anthropic, routing + eviction across the fleet). Note no-auth-yet (governance epic pending), discover-via-/v1/models, cortex.internal:443 still aspirational, and the helexa-bench telemetry endpoint. Repo link updated to helexa/helexa. Co-Authored-By: Claude Opus 4.8 (1M context) --- generic.md | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/generic.md b/generic.md index a45b7e0..2dad34b 100644 --- a/generic.md +++ b/generic.md @@ -548,18 +548,25 @@ templated `step@` unit. That pattern is documented separately in - Podman quadlets for containerised workloads; bare-metal systemd units for native Rust binaries (preferred where feasible). ### GPU / inference -Three bare-metal GPU hosts run [`mistral.rs`](https://github.com/EricLBuehler/mistral.rs) serving an OpenAI-compatible API on port `1234`: +The office site (`hanzalova`) runs the **helexa** self-hosted inference stack ([`git.lair.cafe/helexa/helexa`](https://git.lair.cafe/helexa/helexa)): a candle-native per-host daemon (**neuron**) on each GPU box, fronted by a control-plane gateway (**cortex**). + +**neuron** — one per GPU host, in-process [candle](https://github.com/huggingface/candle) inference (no external engine), OpenAI-compatible API (`/v1/chat/completions`, `/v1/responses`) on port `13131`: | Host | GPU(s) | | --- | --- | -| `beast.hanzalova.internal:1234` | 2× RTX 5090 | -| `benjy.hanzalova.internal:1234` | 1× RTX 4090 | -| `quadbrat.hanzalova.internal:1234` | 1× RTX 3060 | +| `beast.hanzalova.internal:13131` | 2× RTX 5090 | +| `benjy.hanzalova.internal:13131` | 1× RTX 4090 | +| `quadbrat.hanzalova.internal:13131` | 1× RTX 3060 | -- **No TLS, no auth.** The endpoints accept any bearer token (including a dummy one — most clients still require a non-empty token field). They are reachable only via the WireGuard mesh and protected at the network layer. -- Model availability and capacity differ per host. Each host loads a different set depending on VRAM, and the set changes over time. Consumers must discover what's loaded by querying `/v1/models` on each endpoint rather than hard-coding model names to hosts. -- **Planned: unified proxy at `https://cortex.internal:443`.** [`cortex`](https://git.lair.cafe/helexa/cortex) is an in-progress project that will load, evict, and route models across the three backends and expose a single TLS-terminated endpoint. Until it ships as functional, inference consumers must talk to the three backends directly and handle discovery/routing themselves. -- When `cortex` lands, consumers should point at `https://cortex.internal:443` and drop the direct-backend logic. Until then, a simple strategy is: query `/v1/models` on all three hosts, pick the host that has the requested model loaded (prefer larger GPUs first for throughput), and fall back through the list on errors. +(Historical note: these hosts previously ran `mistral.rs` on `:1234`; helexa replaced that with the candle-native neuron daemon — see the project's `CLAUDE.md`.) + +**cortex** — the unified gateway, live on the office proxy host at **`hanzalova.internal:31313`** (API) / `:31314` (Prometheus metrics). It presents a single OpenAI **and** Anthropic compatible surface, routes by model name to the neuron holding the model, and manages load/unload/eviction across the fleet. + +- **Prefer cortex over hitting neurons directly.** Query `GET /v1/models` on cortex for the live, fleet-wide model set (it aggregates the catalogue × topology), then send requests there; cortex handles routing and cold-load. +- **No auth yet.** Endpoints accept any (or no) bearer token; reachable only over the WireGuard mesh and protected at the network layer. Per-account API keys + token budgeting are planned (the helexa governance epic) but not yet enforced. +- **Model set changes over time** and per host (VRAM-dependent) — discover via `/v1/models`, never hard-code model→host. (Currently ≈ Qwen3.6-27B on beast, Qwen3-8B on benjy, Qwen3-1.7B on quadbrat.) +- Not yet TLS-terminated at a vanity name; `cortex.internal:443` remains aspirational. Until then, talk to `hanzalova.internal:31313` directly on the mesh. +- **Fleet benchmark telemetry:** `helexa-bench` on `bob` continuously benchmarks each neuron build and serves a read API (`bob.hanzalova.internal:13132`), visualised at `bench.helexa.ai` (public) / `bench.internal` (mesh) — see `reverse-proxies.md`. ### Source hosting - **New projects are hosted on the self-hosted Gitea instance** at `git.lair.cafe` (or `git.internal` on the WireGuard mesh — both resolve to the same instance). Agentic contributors will usually have MCP access to this Gitea and should prefer it over any public forge when creating repos, issues, or PRs.