docs(generic): refresh GPU/inference for the helexa neuron+cortex stack

§11 still described mistral.rs on :1234 and a "planned" cortex.internal. Update to current reality: candle-native neuron daemons on :13131 per GPU host, fronted by the cortex gateway (live at hanzalova.internal:31313 / :31314, unified OpenAI+Anthropic, routing + eviction across the fleet). Note no-auth-yet (governance epic pending), discover-via-/v1/models, cortex.internal:443 still aspirational, and the helexa-bench telemetry endpoint. Repo link updated to helexa/helexa. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:53:54 +03:00
parent 88357d9607
commit 135601d8aa
1 changed files with 15 additions and 8 deletions
--- a/generic.md
+++ b/generic.md
@@ -548,18 +548,25 @@ templated `step@<name>` unit. That pattern is documented separately in
 - Podman quadlets for containerised workloads; bare-metal systemd units for native Rust binaries (preferred where feasible).

 ### GPU / inference
-Three bare-metal GPU hosts run [`mistral.rs`](https://github.com/EricLBuehler/mistral.rs) serving an OpenAI-compatible API on port `1234`:
+The office site (`hanzalova`) runs the **helexa** self-hosted inference stack ([`git.lair.cafe/helexa/helexa`](https://git.lair.cafe/helexa/helexa)): a candle-native per-host daemon (**neuron**) on each GPU box, fronted by a control-plane gateway (**cortex**).
+
+**neuron** — one per GPU host, in-process [candle](https://github.com/huggingface/candle) inference (no external engine), OpenAI-compatible API (`/v1/chat/completions`, `/v1/responses`) on port `13131`:

 | Host | GPU(s) |
 | --- | --- |
-| `beast.hanzalova.internal:1234` | 2× RTX 5090 |
-| `benjy.hanzalova.internal:1234` | 1× RTX 4090 |
-| `quadbrat.hanzalova.internal:1234` | 1× RTX 3060 |
+| `beast.hanzalova.internal:13131` | 2× RTX 5090 |
+| `benjy.hanzalova.internal:13131` | 1× RTX 4090 |
+| `quadbrat.hanzalova.internal:13131` | 1× RTX 3060 |

- **No TLS, no auth.** The endpoints accept any bearer token (including a dummy one — most clients still require a non-empty token field). They are reachable only via the WireGuard mesh and protected at the network layer.
- Model availability and capacity differ per host. Each host loads a different set depending on VRAM, and the set changes over time. Consumers must discover what's loaded by querying `/v1/models` on each endpoint rather than hard-coding model names to hosts.
- **Planned: unified proxy at `https://cortex.internal:443`.** [`cortex`](https://git.lair.cafe/helexa/cortex) is an in-progress project that will load, evict, and route models across the three backends and expose a single TLS-terminated endpoint. Until it ships as functional, inference consumers must talk to the three backends directly and handle discovery/routing themselves.
- When `cortex` lands, consumers should point at `https://cortex.internal:443` and drop the direct-backend logic. Until then, a simple strategy is: query `/v1/models` on all three hosts, pick the host that has the requested model loaded (prefer larger GPUs first for throughput), and fall back through the list on errors.
+(Historical note: these hosts previously ran `mistral.rs` on `:1234`; helexa replaced that with the candle-native neuron daemon — see the project's `CLAUDE.md`.)
+
+**cortex** — the unified gateway, live on the office proxy host at **`hanzalova.internal:31313`** (API) / `:31314` (Prometheus metrics). It presents a single OpenAI **and** Anthropic compatible surface, routes by model name to the neuron holding the model, and manages load/unload/eviction across the fleet.
+
+- **Prefer cortex over hitting neurons directly.** Query `GET /v1/models` on cortex for the live, fleet-wide model set (it aggregates the catalogue × topology), then send requests there; cortex handles routing and cold-load.
+- **No auth yet.** Endpoints accept any (or no) bearer token; reachable only over the WireGuard mesh and protected at the network layer. Per-account API keys + token budgeting are planned (the helexa governance epic) but not yet enforced.
+- **Model set changes over time** and per host (VRAM-dependent) — discover via `/v1/models`, never hard-code model→host. (Currently ≈ Qwen3.6-27B on beast, Qwen3-8B on benjy, Qwen3-1.7B on quadbrat.)
+- Not yet TLS-terminated at a vanity name; `cortex.internal:443` remains aspirational. Until then, talk to `hanzalova.internal:31313` directly on the mesh.
+- **Fleet benchmark telemetry:** `helexa-bench` on `bob` continuously benchmarks each neuron build and serves a read API (`bob.hanzalova.internal:13132`), visualised at `bench.helexa.ai` (public) / `bench.internal` (mesh) — see `reverse-proxies.md`.

 ### Source hosting
 - **New projects are hosted on the self-hosted Gitea instance** at `git.lair.cafe` (or `git.internal` on the WireGuard mesh — both resolve to the same instance). Agentic contributors will usually have MCP access to this Gitea and should prefer it over any public forge when creating repos, issues, or PRs.