docs(generic): document GPU inference hosts and planned cortex proxy
Add the three mistral.rs backends (beast, benjy, quadbrat) with their GPU capacity and the port 1234 / no-auth / no-TLS contract. Note that consumers must currently discover model availability per-host via /v1/models, and that cortex (git.lair.cafe/helexa/cortex) will eventually unify them behind https://cortex.internal:443. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
14
generic.md
14
generic.md
@@ -532,6 +532,20 @@ Ship these `.path` and cert-reload `.service` units from `asset/systemd/` the sa
|
|||||||
- SELinux enforcing per §10.
|
- SELinux enforcing per §10.
|
||||||
- Podman quadlets for containerised workloads; bare-metal systemd units for native Rust binaries (preferred where feasible).
|
- Podman quadlets for containerised workloads; bare-metal systemd units for native Rust binaries (preferred where feasible).
|
||||||
|
|
||||||
|
### GPU / inference
|
||||||
|
Three bare-metal GPU hosts run [`mistral.rs`](https://github.com/EricLBuehler/mistral.rs) serving an OpenAI-compatible API on port `1234`:
|
||||||
|
|
||||||
|
| Host | GPU(s) |
|
||||||
|
| --- | --- |
|
||||||
|
| `beast.hanzalova.internal:1234` | 2× RTX 5090 |
|
||||||
|
| `benjy.hanzalova.internal:1234` | 1× RTX 4090 |
|
||||||
|
| `quadbrat.hanzalova.internal:1234` | 1× RTX 3060 |
|
||||||
|
|
||||||
|
- **No TLS, no auth.** The endpoints accept any bearer token (including a dummy one — most clients still require a non-empty token field). They are reachable only via the WireGuard mesh and protected at the network layer.
|
||||||
|
- Model availability and capacity differ per host. Each host loads a different set depending on VRAM, and the set changes over time. Consumers must discover what's loaded by querying `/v1/models` on each endpoint rather than hard-coding model names to hosts.
|
||||||
|
- **Planned: unified proxy at `https://cortex.internal:443`.** [`cortex`](https://git.lair.cafe/helexa/cortex) is an in-progress project that will load, evict, and route models across the three backends and expose a single TLS-terminated endpoint. Until it ships as functional, inference consumers must talk to the three backends directly and handle discovery/routing themselves.
|
||||||
|
- When `cortex` lands, consumers should point at `https://cortex.internal:443` and drop the direct-backend logic. Until then, a simple strategy is: query `/v1/models` on all three hosts, pick the host that has the requested model loaded (prefer larger GPUs first for throughput), and fall back through the list on errors.
|
||||||
|
|
||||||
### Source hosting
|
### Source hosting
|
||||||
- **New projects are hosted on the self-hosted Gitea instance** at `git.lair.cafe` (or `git.internal` on the WireGuard mesh — both resolve to the same instance). Agentic contributors will usually have MCP access to this Gitea and should prefer it over any public forge when creating repos, issues, or PRs.
|
- **New projects are hosted on the self-hosted Gitea instance** at `git.lair.cafe` (or `git.internal` on the WireGuard mesh — both resolve to the same instance). Agentic contributors will usually have MCP access to this Gitea and should prefer it over any public forge when creating repos, issues, or PRs.
|
||||||
- **Legacy projects** live under various GitHub / GitLab orgs tied to my public username (`grenade`). These will continue to exist but are being migrated to Gitea over time, especially when they come up for a refactor.
|
- **Legacy projects** live under various GitHub / GitLab orgs tied to my public username (`grenade`). These will continue to exist but are being migrated to Gitea over time, especially when they come up for a refactor.
|
||||||
|
|||||||
Reference in New Issue
Block a user