cortex

Author	SHA1	Message	Date
rob thijssen	602e8e1471	fix(neuron/candle): source tokenizer.json from base repo when GGUF Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Failing after 50s Details CI / Test (push) Failing after 49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m36s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details GGUF-only HF repos (unsloth/Qwen3--GGUF, Qwen/Qwen3--GGUF) ship the .gguf file but not tokenizer.json — the tokenizer data is embedded in the GGUF metadata itself, and the standalone tokenizer.json lives in the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.). Live validation against quadbrat hit: HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF: HTTP status client error (404 Not Found) resolve_files now derives the tokenizer repo by stripping a `-GGUF` or `-gguf` suffix from the model_id; non-GGUF ids fall through to fetching from the same repo. The error message includes the attempted tokenizer repo id so the next failure (e.g. base repo doesn't exist) is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:16:39 +03:00
rob thijssen	84f5662df1	feat(neuron): OpenAI-compatible SSE streaming chat completions Stage 4 of the candle-native pivot. /v1/chat/completions now switches to text/event-stream when the request sets stream: true, emitting one chat.completion.chunk per generated token followed by the OpenAI [DONE] terminator. Pipeline: - chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32), sends the leading role chunk, then spawns a blocking task that acquires the per-model arch lock and runs the streaming generation loop. - run_inference_streaming tracks a cumulative decoded prefix so each chunk's delta.content is the substring added since the last chunk — safe across BPE byte-fallback boundaries that would otherwise split multi-byte UTF-8 chars. - The blocking task aborts cleanly if blocking_send fails (client disconnected), so generation stops when the SSE consumer hangs up. - Final chunk carries finish_reason ("stop" on EOS, "length" on max_tokens). The handler appends data: [DONE] after the channel closes. The Stage 3 streaming 501 placeholder test is repurposed: with the streaming path live, an unloaded model now hits the same 404 surface as the non-streaming path (the model lookup happens first). cortex-gateway's existing proxy is unchanged — it already forwards SSE bytes verbatim from Phase 2 work, so the candle SSE format passes through unmodified. Neuron Cargo.toml gains futures + tokio-stream (both already in workspace deps) for ReceiverStream and stream combinators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:53:14 +03:00
rob thijssen	729317d1ef	feat(neuron): OpenAI-compatible non-streaming chat completion Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <\|im_end\|>/<\|endoftext\|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:47:58 +03:00
rob thijssen	5c2bd1a1da	feat(neuron): wire candle harness load/unload via GGUF Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:49 +03:00
rob thijssen	3cccc2c56b	refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness Stage 1 of the candle-native pivot. Replaces the external-process harness model (mistralrs over HTTP, llamacpp placeholder) with an in-process Harness trait whose sole implementation is candle. The trait keeps its shape so future engines slot in additively, but start/stop default to no-ops and HarnessConfig drops endpoint and systemd_unit since no harness needs external supervision. Behaviour is unchanged on the wire: load_model returns a "not implemented yet (Stage 2)" error and list_models is empty. The gateway-side proxy, poller, and router are untouched. CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are marked superseded; the staged plan lives in ~/.claude/plans/create-a-more-aggressive-calm-naur.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:53:04 +03:00

5 Commits