cortex

Author	SHA1	Message	Date
rob thijssen	9b0ed0b57f	fix(router): rewrite loopback inference URLs to use neuron's host Some checks failed CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 41s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Clippy (push) Successful in 7m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m57s Details build-prerelease / Build cortex binary (push) Successful in 4m15s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s Details CI / Test (push) Failing after 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Neuron hardcodes its bind_url as `http://localhost:13131` (it can't reliably know its own externally-resolvable name). When cortex runs on a different host than the neuron it's routing to, blindly proxying to that URL hits localhost on the cortex box instead of the neuron. Cortex already knows each neuron's reachable host from cortex.toml. After fetching the inference URL from `/models/{id}/endpoint`, if the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1), swap it for the configured neuron host. Preserve the port and path from neuron's URL so a future harness serving inference on a different port than the management API still works. Adds `url` (already a transitive dep via reqwest) as a direct dep for the URL parsing. Tests cover: localhost rewrite, distinct inference port preservation, non-loopback passthrough, malformed input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:23:47 +03:00
rob thijssen	8d7b099b36	feat(stage-8d-7): direct safetensors fused-region loader Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m18s Details CI / Test (push) Successful in 4m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Replaces load_fused_qkv_slice_2d/_3d with reads from a separate MmapedSafetensors handle. Each per-rank fused tensor is built by reading the three region byte-slices directly from the mmap, concatenating them host-side, and uploading as one device allocation — no full-fused-tensor device materialisation. The prior approach allocated a ~100 MB transient device tensor per linear-attention layer; on Qwen3.6-27B with 48 linear-attn layers that's ~4.8 GB of allocator churn during load — enough to fragment the cuda caching allocator on a tight-VRAM 32 GB consumer GPU, which is what triggered the layer-22 up_proj OOM seen on beast. Threading: MmapedSafetensors flows worker → ForCausalLM → Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs) and worker (worker.rs) construct their own mmap; Linux's page cache shares the underlying pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 17:49:35 +03:00
rob thijssen	1ebbe87651	feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m23s Details build-prerelease / Build neuron-blackwell (push) Has started running Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Stage 8d (new): port the Gated DeltaNet CUDA kernels from EricLBuehler/mistral.rs to close the ~500x decode performance gap we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path vs ~37 T/s in mistralrs on the same hardware). This commit lays the build infrastructure with zero behavioural change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the qwen3_5 architecture and TP variant. Added: - `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile every `src/cuda/.cu` file into `libneuroncuda.a` under the `cuda` feature, then links it + `cudart`. Mirrors mistralrs's `mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same sm_<80 bf16 gate). - `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from upstream: `gated_delta_rule_recurrence` (V-tiled per-token decode) * `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill) * `causal_conv1d_update` (single-token conv decode) * `causal_conv1d_full` (multi-token conv prefill) * `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) * softplus(a + dt_bias)) - `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels, cudarc::CudaSlice::device_ptr boilerplate identical to upstream. - `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of upstream's ffi.rs covering only the five GDN kernels; MoE / SSM / top-k decls land here when we absorb those too). - `crates/neuron/src/cuda/mod.rs` — re-exports + module docs. Cargo wiring: `cudaforge` added as an optional build-dep, activated by the `cuda` feature. CPU build is unchanged (the `cuda/` module is fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the patched container compiles `gdn.cu` (1 of 1 kernels) and links clean. Licensing: upstream files preserve their MIT origin via per-file comment banners pointing to the mistralrs path. No behaviour-relevant edits to the .cu kernels — local diff against upstream is just the banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream; their structure (module path `crate::cuda::ffi::*`) matches identically so future kernel imports drop in unchanged. CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean inside the runner container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:34:11 +03:00
rob thijssen	96d8755245	fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m50s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m42s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s Details Two follow-up cuda-only fixes surfaced by `cargo build --features cuda` inside the cuda-13.0 runner container: 1. `half::{bf16, f16}` was an undeclared dep. Added `half = "2.5"` (matching candle-core's pinned major) under the cuda feature flag. 2. `dev.alloc::<T>(n)` already returns `candle_core::Result` (it calls `.w()` internally on the cudarc error). Calling `.w()?` on top of that needs `From<candle_core::Error> for CudaError`, which doesn't exist — collapse to `?`. Removed the now-unused `cuda_backend::WrapErr` import. Verified by `cargo build -p neuron --features cuda` and `cargo clippy -p neuron --all-targets --features cuda -- -D warnings` inside `git.lair.cafe/gongfoo/runner-cuda-13.0` with the local glibc/CUDA-13.0 math_functions.h noexcept patch. CPU clippy/tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 19:11:59 +03:00
rob thijssen	da068ded6d	Stage 7a-ii: real NCCL handshake behind the worker pool Some checks failed CI / Format (push) Failing after 38s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m33s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Failing after 4m19s Details build-prerelease / Build neuron-ada (push) Failing after 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i. With --features cuda the leader and its workers now establish a live NCCL communicator end-to-end; without the feature the same code paths return Error{kind="cuda_feature_not_enabled"} so a misconfigured build is obvious instead of silently no-op. NCCL state machine (harness/tp/nccl_state.rs) is shared between the worker process and the leader's pool: - generate_comm_id_hex() mints an Id::new() on the leader. - NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit, opens a CudaContext on the configured device, calls Comm::from_rank with the supplied (rank, world_size, id). NCCL blocks until every rank has joined. - NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader asserts every rank reports observed_sum == world_size. - NCCL handles serialised under Mutex; unsafe impl Send/Sync gates the Comm across spawn_blocking boundaries (NCCL is move-safe; only concurrent op issuance is unsafe). WorkerPool::init_nccl orchestrates the rendezvous: 1. Write Init { comm_id } to every worker's stdin (no await yet). 2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking, concurrently with workers. 3. NCCL handshake completes for all ranks simultaneously. 4. Leader collects InitOk responses. WorkerPool::nccl_sanity_check follows the same pattern over all_reduce, validating world_size == observed_sum on every rank. Worker.send_only / Worker.recv_only split out from the previous monolithic Worker.request so the leader can interleave its own NCCL work with the worker calls — required because NCCL blocks during init. Tests: - 4 hex roundtrip unit tests for the wire encoding. - The 7a-i "not implemented" expectation now reads "cuda_feature_not_enabled" on the local dev box (no CUDA), or accepts InitOk on a cuda-built test binary. - New cuda-integration test in tp_worker_lifecycle_cuda.rs covers the real init + sanity round-trip; gated on the cuda-integration feature so default CI doesn't try to NCCL. Verifiable on beast (2× RTX 5090): cargo test -p neuron --features cuda-integration \ --test tp_worker_lifecycle_cuda Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 16:40:01 +03:00
rob thijssen	84f5662df1	feat(neuron): OpenAI-compatible SSE streaming chat completions Stage 4 of the candle-native pivot. /v1/chat/completions now switches to text/event-stream when the request sets stream: true, emitting one chat.completion.chunk per generated token followed by the OpenAI [DONE] terminator. Pipeline: - chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32), sends the leading role chunk, then spawns a blocking task that acquires the per-model arch lock and runs the streaming generation loop. - run_inference_streaming tracks a cumulative decoded prefix so each chunk's delta.content is the substring added since the last chunk — safe across BPE byte-fallback boundaries that would otherwise split multi-byte UTF-8 chars. - The blocking task aborts cleanly if blocking_send fails (client disconnected), so generation stops when the SSE consumer hangs up. - Final chunk carries finish_reason ("stop" on EOS, "length" on max_tokens). The handler appends data: [DONE] after the channel closes. The Stage 3 streaming 501 placeholder test is repurposed: with the streaming path live, an unloaded model now hits the same 404 surface as the non-streaming path (the model lookup happens first). cortex-gateway's existing proxy is unchanged — it already forwards SSE bytes verbatim from Phase 2 work, so the candle SSE format passes through unmodified. Neuron Cargo.toml gains futures + tokio-stream (both already in workspace deps) for ReceiverStream and stream combinators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:53:14 +03:00
rob thijssen	5c957d08ec	ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe Some checks failed CI / Format (push) Successful in 36s Details CI / Test (push) Failing after 53s Details CI / Clippy (push) Successful in 2m35s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Adds a manually-triggered workflow that builds CUDA-flavoured neuron binaries and a CPU cortex binary, packages them as Fedora RPMs, signs them, and rsyncs to the unstable channel at https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build pipeline used by grenade/mistralrs-package. Pipeline: - prepare: derive {version,short_sha,commit_date} from the checkout; the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below the eventual "1" stable release. - build-cortex: cargo build --release -p cortex-cli on a rust runner. - build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn" and CUDA_COMPUTE_CAP set per flavour. - package-{cortex,neuron}: rpmbuild on the rpm runner against the new prebuilt-binary specs in rpm/. - publish: import signing key, sign RPMs, rsync to oolon, createrepo_c --update, then regenerate packages.json for the UI. New specs are prebuilt-binary variants — they consume the artifact from the build job rather than running cargo at rpmbuild time. Each helexa-neuron-{flavour} package Conflicts with the other flavours and with helexa-neuron (the future source-build stable package) so one flavour is installed at a time on a given host. neuron crate gains cudnn and flash-attn feature flags forwarding to the corresponding candle features, so the CI build command compiles those kernels into the binary. sccache is intentionally NOT used in the prerelease jobs — CUDA compute cap isn't in its cache key, so flavours would mis-hit each other. Each prerelease build is a clean cargo build. Required Gitea secrets (already in place for cortex.spec / COPR workflow): - RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID - RSYNC_SSH_KEY Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:01:35 +03:00
rob thijssen	729317d1ef	feat(neuron): OpenAI-compatible non-streaming chat completion Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <\|im_end\|>/<\|endoftext\|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:47:58 +03:00
rob thijssen	5c2bd1a1da	feat(neuron): wire candle harness load/unload via GGUF Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:49 +03:00
Gitea Actions	b9d8e30058	chore: bump version to 0.1.16	2026-04-16 15:04:21 +00:00
Gitea Actions	9bf987888c	chore: bump version to 0.1.14	2026-04-16 16:57:24 +03:00
Gitea Actions	357f858a29	chore: bump version to 0.1.12	2026-04-16 15:47:21 +03:00
Gitea Actions	7ece281617	chore: bump version to 0.1.10	2026-04-16 15:06:18 +03:00
Gitea Actions	9fa51ad874	chore: bump version to 0.1.8	2026-04-16 10:56:07 +00:00
Gitea Actions	2ce1060cb8	chore: bump version to 0.1.7	2026-04-16 13:25:34 +03:00
Gitea Actions	52c8b4c983	chore: bump version to 0.1.5	2026-04-16 13:01:42 +03:00
Gitea Actions	f161412f91	chore: bump version to 0.1.3	2026-04-16 11:41:11 +03:00
Gitea Actions	7c60af3464	chore: bump version to 0.1.2	2026-04-16 11:03:29 +03:00
rob thijssen	6c238f4557	refactor: rename cortex-neuron binary and crate to neuron All checks were successful CI / Format, lint, build, test (push) Successful in 2m28s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details Package name, lib name, and binary all now just "neuron" without the cortex- prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:51:15 +03:00
rob thijssen	e42e8ee81f	refactor: cortex talks to neurons instead of mistral.rs directly All checks were successful CI / Format, lint, build, test (push) Successful in 2m46s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details Replace NodeConfig (static vram_mb, pinned) with NeuronEndpoint. Hardware discovery and model pinning now come from neuron API and models.toml catalogue respectively. - config.rs: nodes -> neurons, add models_config path - catalogue.rs: ModelProfile with pinned_on, ModelCatalogue - poller.rs: poll neuron GET /models (ModelInfo format) - router.rs: resolve inference endpoint via neuron GET /models/{id}/endpoint - evictor.rs: call neuron POST /models/unload - node.rs: remove vram_mb, pinned fields (come from discovery/catalogue) - All 22 gateway tests updated to mock neuron API - Remove MistralModelsResponse, ModelLifecycleRequest (no longer needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:42:52 +03:00
rob thijssen	26e5e7ead8	feat: implement mistral.rs harness and neuron model API All checks were successful CI / Format, lint, build, test (push) Successful in 2m30s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details - MistralRsHarness: Harness trait impl wrapping mistral.rs HTTP API (list/load/unload models, health check, start/stop via systemd) - HarnessRegistry: maps harness name -> Box<dyn Harness>, built from neuron.toml config - Neuron API endpoints: GET /models, POST /models/load, POST /models/unload, GET /models/:id/endpoint - NeuronConfig: figment-based config loading from neuron.toml - Integration test: full model lifecycle through mock mistral.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:29:42 +03:00
rob thijssen	6dc717ebcd	feat: add neuron daemon with GPU discovery and health endpoints All checks were successful CI / Format, lint, build, test (push) Successful in 2m29s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details Replace cortex-agent stub with neuron (cortex-neuron binary). cortex-core additions: - discovery.rs: DeviceInfo, DiscoveryResponse, DeviceHealth, HealthResponse - harness.rs: Harness async trait, HarnessConfig, ModelSpec, ModelInfo neuron crate (crates/neuron/): - discovery.rs: nvidia-smi CSV parsing (pure functions) + system discovery via uname/nvidia-smi/nvcc - health.rs: cached GPU health polling every 5s - api.rs: GET /discovery and GET /health axum handlers - main.rs: CLI entrypoint with --port flag (default 9090) - harness stubs for mistralrs (Phase 8) and llamacpp (Phase 11) 12 new tests (9 unit + 3 integration), 35 total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:23:42 +03:00
rob thijssen	0da68833af	feat: scaffold cortex workspace Rust reverse-proxy for multi-node mistral.rs inference clusters. Includes crate structure (cortex-core, cortex-gateway, cortex-agent, cortex-cli), config loading, OpenAI/Anthropic translation stubs, model routing, eviction, polling, and streaming proxy scaffolding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:13:30 +03:00

23 Commits