cortex

Author	SHA1	Message	Date
rob thijssen	5aac1ffc59	feat(helexa-acp): session resume via session/load All checks were successful CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m37s Details CI / Test (push) Successful in 4m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m35s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 7m45s Details build-prerelease / Build neuron-ada (push) Successful in 5m31s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Zed restarts (frequent during helexa-acp dogfooding) used to lose every conversation because we'd ignore the load_session capability and treat every project-reopen as a fresh session/new. Persist sessions to disk and honour session/load so the agent panel comes back where it left off. Storage layout: $XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json Each file holds session_id, cwd, model_id, mode_id, full Message history, plus created/updated timestamps. Atomic save via tempfile+rename so a crash mid-write can't corrupt the store. Touch points: - src/store.rs (new) — sessions_dir() resolution, save/load via default and explicit-dir entry points (so unit tests don't have to race on XDG_DATA_HOME). 5 unit tests cover round-trip, not-found errors, atomic overwrite, tool-call/result preservation, and the filename sanitiser's path-traversal handling. - src/provider/mod.rs — Serialize/Deserialize on Role, Message, MessageContent, ToolCall. MessageContent::Text turned into a struct variant ({text: ...}) so internally-tagged JSON works. - src/agent.rs — initialize_response advertises load_session: true; handle_load_session reads the file, snapshots in-memory state, returns LoadSessionResponse with the persisted mode preselected; drive_prompt persists at the end of every prompt round under the session lock with the I/O outside the lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:34:42 +03:00
rob thijssen	ec2b6450b2	feat(helexa-acp): infer tool name from arg shape when model omits it Some checks are pending build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m33s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Test (push) Successful in 5m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m40s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m33s Details build-prerelease / Package cortex RPM (push) Successful in 8m20s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details Qwen3.6-27B occasionally emits a <tool_call> body with the right arguments but no top-level `name` field — observed in the field as mkdir-style bash calls like {"arguments":{"command":"mkdir -p .../doc/plan/{01-discovery,...}"}} with no `name`. The agent had no tool to dispatch and surfaced a Failed card; the model would then hang or retry the same shape. Add a shape-based inference layer: - tools::infer_tool_name(arguments) — given an `arguments` object alone, return Some(name) when the key set uniquely identifies one tool: `{command}` or `{command,cwd}` → bash, `{path,content}` → write_file, `{path,old_text,new_text}` → edit_file. Ambiguous shapes (`{path}` alone — could be read_file or list_dir) return None so the agent still emits a Failed card rather than guessing. - agent::try_repair_missing_name(raw) — parses a malformed body, applies infer_tool_name, returns (name, args_json) on success. - drive_prompt sweeps malformed_calls through this repair before the Failed-card path. Recovered calls go into tool_buckets at the next free index and dispatch through the normal tool loop. 10 new unit tests in tools::tests cover the inference table plus the verbatim mkdir failure from the field log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:14:50 +03:00
rob thijssen	a494c8d43c	feat(helexa-acp): repair malformed tool calls and render failures as cards Some checks failed build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 28s Details CI / Format (push) Successful in 4m7s Details CI / Test (push) Failing after 1m2s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details CI / Clippy (push) Successful in 2m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-ampere (push) Successful in 8m18s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ada (push) Successful in 5m23s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Two related fixes for cases where Qwen3 sometimes emits slightly-off JSON inside <tool_call> blocks: 1. JSON repair pass in qwen3::parse_tool_call_body — strip up to three trailing extra `}` characters (model overshoots its closing braces), and hoist `name` out of `arguments` when it lands nested instead of as a sibling. Both observed in the field; both trivially repairable; both now dispatch as normal tool calls instead of falling back to the malformed path. 2. New CompletionEvent::MalformedToolCall variant for the cases repair can't fix. decode_stream now emits it instead of wrapping the raw body in a TextDelta, and agent.rs surfaces each one as a Failed SessionUpdate::ToolCall card (so Zed renders it as a structured failure UI element rather than dumping the body inline) plus a synthetic tool-call/tool-result history pair so the model gets clear feedback for self-correction on the next round. Empty <tool_call></tool_call> blocks are now a no-op too (no Malformed event), matching the existing empty-<think> behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:58:51 +03:00
rob thijssen	abbedf8d8a	chore(neuron): bump default max_tokens from 512 to 8192 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 45s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m35s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details 512 is too low for any modern coding model — clients that don't explicitly set max_tokens get clipped responses with no diagnostic. Bump the fallback at all four inference call sites (single-GPU streaming + non-streaming, TP leader + non-leader) to 8192, which fits comfortably within Qwen3-class context windows after a typical agent prompt and lines up with what helexa-acp / a0 / curl clients reasonably expect. Clients that explicitly set max_tokens (now including helexa-acp via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:38:28 +03:00
rob thijssen	6cc14e925c	feat(helexa-acp): per-endpoint max_tokens config Some checks failed CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Failing after 1m3s Details CI / Test (push) Failing after 1m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The agent was sending max_tokens: None, letting cortex/neuron pick its own default — which trips Zed's "Output Limit Reached" on long turns. Add a per-endpoint max_tokens option in EndpointConfig (TOML key and HELEXA_ACP_MAX_TOKENS env var for the single-endpoint fallback) that the agent threads into every CompletionRequest by endpoint name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:34:23 +03:00
rob thijssen	1c16732668	feat(helexa-acp): route Qwen3 inline <think> blocks to reasoning Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 26s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Clippy (push) Successful in 2m40s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Qwen3 emits chain-of-thought as literal <think>...</think> tags inside delta.content rather than via the separate reasoning_content field — so without parsing the markers, the thinking shows up in the message pane as ordinary text. Add a small ThinkParser in qwen3.rs (same chunk-boundary discipline as ToolCallParser) and stage it after the tool-call parser in decode_stream: text events from the tool-call parser are fed in and split into TextDelta / ReasoningDelta. Zed now renders thinking in its dedicated thought UI; visible answer text stays in the message pane. The parking-lot entry from the plan is now closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:30:25 +03:00
rob thijssen	5a0861d639	fix(helexa-acp): forward Dispatch::Response to its awaiting router Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Test (push) Successful in 5m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m29s Details build-prerelease / Build neuron-ampere (push) Successful in 7m18s Details build-prerelease / Build neuron-ada (push) Successful in 5m6s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The catch-all on_receive_dispatch handler was applying respond_with_error to every Dispatch variant, including Response. For Response variants, that call routes the error to the ResponseRouter for the outgoing request — silently overwriting the real reply from Zed with "Internal error: not implemented yet". Every ACP roundtrip we issue (fs/read_text_file, fs/write_text_file, session/request_permission, terminal/*) was therefore returning an error to the tool runner regardless of what Zed actually responded. The model saw uniformly-failing tools, gave up, and confabulated plausible explanations. Fix: pattern-match the Dispatch. Response → forward to its router via respond_with_result. Request / Notification → keep the "not implemented yet" error response as before. Found via debug logs showing WARN helexa_acp::agent: unhandled ACP message method="fs/read_text_file" right before every tool failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:16:21 +03:00
rob thijssen	33652ac651	feat(helexa-acp): HELEXA_ACP_LOG_FILE env for editor-host logging All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m44s Details CI / Test (push) Successful in 5m3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 8m23s Details build-prerelease / Build neuron-ada (push) Successful in 5m26s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Editors that launch ACP agents (Zed today) don't reliably surface the child's stderr — and `args` in an `agent_servers` config is exec-args, not shell, so the usual `&>>` redirect trick doesn't work. Add a HELEXA_ACP_LOG_FILE env var that, when set to an absolute path, routes the tracing subscriber to append-write that file (ANSI off) instead of stderr. RUST_LOG still controls levels. Unopenable paths fall back to stderr with a warning so a typo doesn't silence the agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:47:28 +03:00
rob thijssen	c297a54074	chore(helexa-acp): log raw bash output and tool result snippets All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m38s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m34s Details build-prerelease / Build cortex binary (push) Successful in 4m49s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 7m46s Details build-prerelease / Build neuron-ada (push) Successful in 7m38s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Diagnostic for "the tool ran but the model thinks it failed" cases. Logs at debug level: - exec_bash: terminal/create command + cwd, terminal/exit code/signal, terminal/output bytes + truncated flag + 200-char snippet. - dispatch_tool_call: 200-char snippet of every successful result before it's folded back into history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:15:26 +03:00
rob thijssen	0121a1930f	feat(helexa-acp): inject and parse Qwen3 Hermes tool format Some checks failed CI / Format (push) Successful in 38s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m33s Details CI / Test (push) Successful in 5m45s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m0s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 7m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details The OpenAI `tools` API field isn't load-bearing in this stack — neuron's chat template renders only message.content, so tool definitions sent that way never reach the model. Move both sides of the tool conversation into the Qwen3 Hermes wire format the model is actually trained on: - Append a `# Tools` block to the system prompt describing every available function (qwen3::render_tool_block). - Parse `<tool_call>{json}</tool_call>` markers out of the streamed content via a chunk-boundary-safe state machine (qwen3::ToolCallParser), surfacing them as the existing CompletionEvent::ToolCall* events so the agent loop doesn't change. - Re-serialise assistant turns that called tools with inline `<tool_call>` blocks and tool results as user turns wrapped in `<tool_response>` (qwen3::render_assistant_with_tool_calls, render_tool_response). Verified against cortex+Qwen3.6-27B: the model produces a well-formed `<tool_call>{"name":"list_dir","arguments":{"path":"/tmp"}}</tool_call>` in response to a Hermes-formatted prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:06:38 +03:00
rob thijssen	13f4c36aeb	chore(helexa-acp): log outgoing chat-completion body at debug level Some checks failed build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 47s Details CI / Clippy (push) Failing after 56s Details CI / Test (push) Successful in 5m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m22s Details build-prerelease / Build cortex binary (push) Successful in 6m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m14s Details build-prerelease / Build neuron-ada (push) Successful in 5m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Useful for diagnosing "the model isn't using tools" — confirming that helexa-acp is in fact sending the `tools` array (and what messages, system prompt, etc. accompany it) without having to attach a packet capture upstream of cortex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 10:38:10 +03:00
rob thijssen	4a51a54554	fix(helexa-acp): describe Stage 3 tools in the default system prompt Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 42s Details CI / Clippy (push) Successful in 2m39s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The Stage 2 prompt told the model it had no tools, which models trained for caution then dutifully repeat back ("Stage 2 build: no tools available — I can't read files…"). Stage 3 ships tools in the CompletionRequest.tools array, but the system message was still overriding that. Update the default prompt to list the five tools and instruct the model to use them rather than asking the user to paste contents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 10:33:17 +03:00
rob thijssen	0609f1ac5d	feat(helexa-acp): add tools, session modes, and permission gating All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m38s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m54s Details build-prerelease / Build neuron-ampere (push) Successful in 7m54s Details build-prerelease / Build neuron-ada (push) Successful in 4m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m14s Details build-prerelease / Build cortex binary (push) Successful in 4m9s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 6m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 3m54s Details Stage 3 introduces five tools (read_file, write_file, edit_file, list_dir, bash) backed by ACP fs/* and terminal/* calls, a ClientOps trait so the runner is mock-testable, two session modes (default + bypassPermissions) with session/set_mode honouring them, and a tool-call loop in the agent that streams the model, dispatches each call, feeds results back into history, and re-enters until the model finishes or MAX_TOOL_ROUNDS is hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 10:01:32 +03:00
rob thijssen	96fc379893	feat(helexa-acp): wire ACP agent loop for text-only conversations Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 41s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 5m26s Details CI / Test (push) Successful in 5m43s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m28s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Stage 2 lands the agent loop on top of the Stage 1 scaffold: session state with per-session cancellation, a system-prompt builder honouring HELEXA_ACP_SYSTEM_PROMPT_PATH / system_prompt_path TOML, and handlers for initialize / session/new / session/prompt / session/cancel that stream provider output back as session/update notifications. Verified end-to-end against cortex from Zed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 09:46:22 +03:00
rob thijssen	e267f583e1	chore(neuron): rustfmt drift in is_device_fault test Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 58s Details CI / Clippy (push) Failing after 3m43s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m41s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details One assert! call grew past the line limit after the previous commits; cargo fmt --all picked it up. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:55 +03:00
rob thijssen	e23d5011d0	feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat Adds a new workspace crate `helexa-acp` (binary, Apache-2.0) — the start of "the missing ACP binary" for multi-endpoint LLM setups mixing public APIs, private LAN deployments, and various wire formats. Today it speaks OpenAI /v1/chat/completions; the Provider trait is the seam that lets OpenAI Responses, Anthropic /v1/messages, and other wire formats slot in later without touching the agent loop. The crate is intentionally self-contained — no dependencies on the other workspace crates (cortex-core, cortex-gateway, neuron) — so a future migration to a dedicated GitHub repo is a Cargo.toml-only change. All deps come from crates.io. This commit lands: * `config.rs` — TOML config at $XDG_CONFIG_HOME/helexa-acp/config.toml with multi-endpoint support (each `[[endpoints]]` declares its name, base_url, wire_api, default_model, optional API key / api_key_env). Falls back to env-only single-endpoint config when no TOML exists (HELEXA_ACP_BASE_URL, HELEXA_ACP_MODEL, etc.). The `endpoint:model` selector syntax is validated and tested. * `provider/mod.rs` — `Provider` trait + provider-agnostic types (`CompletionRequest`, `CompletionEvent`, `Message`, `ToolCall`, `ToolSpec`, `Role`, `UsageStats`). Agent loop consumes these without knowing the wire format on the other side. * `provider/openai_chat.rs` — `OpenAIChatProvider` impl. Compatible with cortex, LM Studio, Ollama (compat mode), OpenRouter, OpenAI itself. Streams via reqwest + eventsource-stream + async-stream. Surfaces text deltas, reasoning deltas (for models that emit `reasoning_content`), tool-call lifecycle (start, args-delta, completion), usage, finish reason. Cancellation-token aware. * `main.rs` — tokio + stderr-only tracing-subscriber + Stdio transport. Builds a provider per configured endpoint at startup, surfacing config mistakes before the editor even initializes. Currently responds to `initialize`; everything else stubs to `not implemented yet` until the agent loop lands in the next commit. 12 unit tests pass — encoder shape, decoder shape (text-only, tool-call progressive, cancellation, malformed-chunk recovery), config parsing (multi-endpoint TOML, env fallback, validation). The `#![allow(dead_code)]` on `provider/mod.rs` is temporary — the agent loop in the next commit reads every field. It's noted in the module-level docstring so the next reader knows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:47 +03:00
rob thijssen	249b2e5c98	fix(neuron): only poison the model on actual device faults Some checks failed build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m22s Details CI / Test (push) Successful in 4m55s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CI / Format (push) Failing after 33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Previously every inference Err — shape mismatch, NaN logits, tokenizer error, missing handle — marked the model poisoned and rejected every subsequent request until an operator unload+reloaded. The benjy incident on 2026-05-27 showed how this misfires: a concurrency bug produced a `broadcast_add: shape mismatch` error that had nothing to do with CUDA, but the model was taken down anyway. Add `is_device_fault(err_chain: &str)` — a conservative classifier that returns false only for errors we know are pre-kernel / CPU-side (shape mismatches, NaN logits, tokenize/detokenize, missing handle, DecodeStream, empty prompt). Everything else defaults to true so a genuine driver fault still poisons. Applied at all six poisoning sites: - chat_completion CUDA worker path - chat_completion CPU spawn_blocking path - chat_completion_stream CUDA worker path - chat_completion_stream CPU spawn_blocking path - chat_completion_tp non-streaming wrapper - chat_completion_tp_stream spawned task Each site now logs either "model marked poisoned" (device fault) or "model NOT marked poisoned" (non-device) so the journal makes the classification visible. Tests cover the known non-device patterns and a couple of real CUDA driver messages. Pairs with the inference_lock commit (`c59da83`): together they eliminate both the cause of the spurious-poisoning we just observed (the shape mismatch) AND the over-reaction to it (the unconditional poison). Each fix is independently useful but the combination is what makes the system actually robust to concurrent agent workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:57:48 +03:00
rob thijssen	c59da83636	fix(neuron): serialise single-GPU inference per loaded model Two concurrent chat_completion requests against the same single-GPU model could interleave their `clear_kv_cache → forward(chunk0) → forward(chunk1) → ...` sequences. The device-worker channel serialises individual jobs but not the sequence boundary, so the cache could end up holding tokens from one request while another's mask was sized for its own prompt — producing a shape mismatch mid-prefill. Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories` and `memorize solutions` extensions fired 4ms apart against Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV cache, and request a08b4a's chunk 0 forward produced scores of shape [1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add failed, both requests bubbled the error up, both flipped the model to poisoned. Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring the TpLoadedModel.pool lock that the TP path already held. Acquire it at the start of `chat_completion` and inside the spawned task of `chat_completion_stream` (so the role chunk goes out immediately and only the inference work queues behind the lock). The CPU branch uses `blocking_lock` from inside spawn_blocking; the CUDA branch uses async `.lock().await` inside tokio::spawn. Throughput impact: zero. The GPU was already serialised at the device-worker channel — multiple requests just produced corrupt KV cache state instead of clean serial throughput. The lock makes the existing serialisation honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:54:04 +03:00
rob thijssen	f05882369d	fix(neuron): don't poison the model on tokio JoinError panics All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m28s Details build-prerelease / Build neuron-ampere (push) Successful in 8m25s Details build-prerelease / Build neuron-ada (push) Successful in 8m54s Details CI / Test (push) Successful in 4m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA driver failures propagate as Err through `?` and become `Ok(Err(InferenceError::Other(_)))` from the spawned task — those are real device faults and still poison the model. Tokio JoinError is different: it fires on Rust-level panic (tokenizer bug, sampler bug, serialisation, the UTF-8 slice that landed in commit `bd04d7f` before the fix) or task cancellation. Those don't touch the device context, so failing the one request without tearing down the model is correct. Two sites changed: - chat_completion's CPU spawn_blocking handler — JoinError no longer sets loaded.poisoned. - chat_completion_tp's tokio::spawn wrapper — JoinError no longer sets tp_for_marker.poisoned. The inner-Err case still does. Each path logs the cause (panicked / was cancelled / ended abnormally) explicitly so the journal makes the new behaviour obvious — search for "model NOT marked poisoned" to find these events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:02:52 +03:00
rob thijssen	bd04d7f580	fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji) across multiple tokens, the previous "decode the cumulative token list, byte-slice the delta against a stored prefix" pattern would panic with 'start byte index N is not a char boundary; it is inside <emoji>'. The race: at step N the tokenizer renders the partial bytes as U+FFFD (3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4 bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the codepoint in step N+1's `full` string, and `&str[start..]` panics. Replace with tokenizers' `DecodeStream::step(id)` which maintains an internal byte buffer across token boundaries and only emits when a clean codepoint completes. Applied at all three SSE emission sites: - stream_inference_via_worker (single-GPU CUDA stream) - chat_completion_tp_stream's spawned task (TP stream) - run_inference_streaming (CPU stream) The shared emit helper splits into emit_delta (async, mpsc::send) and emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its existing send semantics. The old emit_chunk helper that did the unsafe full-decode-and-slice is removed entirely. Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call response after a long agent-zero session; the spawned TP stream task panicked at candle.rs:2648. The model itself stayed healthy (no CUDA fault), only the one streaming request died. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:01:24 +03:00
rob thijssen	1e13889392	feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m7s Details build-prerelease / Build neuron-ampere (push) Successful in 12m16s Details build-prerelease / Build neuron-ada (push) Successful in 12m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload cycle observed on beast under agent-zero workloads. Three changes, all keyed off env-driven knobs so an operator can tune without a rebuild: 1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The initial forward is split into N-token windows, each with a monotonically growing offset. KV cache accumulates across chunks exactly as it would under one big prefill; only the final chunk's logits are kept for sampling. Activation memory now scales with chunk size instead of prompt length, so a 13 k-token prompt stops holding tens of GB of intermediate activations live at once. Wired into all six prefill call sites: - run_inference / run_inference_streaming (CPU path) - run_inference_via_worker / stream_inference_via_worker (CUDA single-GPU through device worker) - chat_completion_tp_inner / chat_completion_tp_stream (TP via WorkerPool) Three helpers — chunked_prefill_local, chunked_prefill_via_worker, chunked_prefill_tp — own the loop shape so the chunking semantics stay identical across paths. Per-chunk debug log shows progress. 2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384). Requests above the cap return a structured 400 with `code: prompt_too_long` rather than going through the prefill and discovering the limit by OOMing partway through. New InferenceError::PromptTooLong variant. 3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500). If `vram_free_mb` is below the threshold at request start (e.g. another concurrent request is mid-prefill), reject with a clean 503 + `code: insufficient_vram` rather than starting work that will OOM. New InferenceError::InsufficientVram variant. CPU loads (vram=0 sentinel) skip this check. All three gates fire BEFORE any device work, so a rejected request costs ~one tokenisation pass and never touches the worker thread — poison cascades from rejected work are now impossible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:46:54 +03:00
rob thijssen	6e1c1dd0fc	ci: retry clippy + test up to 3 times on spurious sccache failures All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 5m7s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m2s Details build-prerelease / Build neuron-ada (push) Successful in 12m23s Details build-prerelease / Build neuron-ampere (push) Successful in 12m26s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details sccache occasionally fails mid-compile with race-condition errors that clear on a re-run without any code changes. Rather than tracking that down right now, wrap the two affected steps in a bash loop that retries up to three times with a 5-second pause. Real failures still surface; they just take ~10s longer to fail. fmt is left as a single invocation — it's a one-shot syntactic check, not a build, and isn't subject to the same sccache races. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:55:18 +03:00
rob thijssen	35876954cd	chore(neuron): default tracing filter to info (was info,neuron=debug) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m43s Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Production deployments that want neuron-internal debug detail (e.g. trim_device_pool's per-clear-kv line, slab inserts/drops) override RUST_LOG explicitly via systemd. Defaulting to debug for the whole neuron target produced a lot of journal volume that wasn't useful in the common case. beast already sets RUST_LOG=debug in /etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:47:30 +03:00
rob thijssen	740299bd9d	chore(neuron/beast): switch default-model quant from q5k to q6k Some checks failed CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m22s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m35s Details CI / Test (push) Successful in 5m8s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m16s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details q5k produced NaN logits on Qwen/Qwen3.6-27B under candle TP=2 (sampler fell over with "logits unhealthy nan: 248320/248320"). q6k is the quant that worked well in production under mistral.rs on the same hardware, so it's the right baseline for verifying the mempool-trim fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:18 +03:00
rob thijssen	cdf0f4e66d	fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync returns memory to the device's default mempool, not to the OS), so mem_get_info under-reports free VRAM between requests. With Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw ~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into CUBLAS_STATUS_INTERNAL_ERROR depending on quant. Add a cuda-gated trim_device_pool helper that, after each successful clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool, 0) against the device's default mempool. Failures (no async-alloc support, transient driver errors) are non-fatal and log at debug. The before/after free-VRAM delta is logged so an operator can correlate the trim with the next request's prefill VRAM. ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors correctly; the leak was strictly at the cudarc pool layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:13 +03:00
rob thijssen	c4954e0eed	docs: per-device worker thread architecture (phase 5 of refactor) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details CI / Test (push) Successful in 5m10s Details build-prerelease / Build cortex binary (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 5m16s Details build-prerelease / Build neuron-ada (push) Successful in 4m58s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 10m36s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Closes the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. CLAUDE.md: - New "Per-device worker thread (neuron)" section under Key design decisions, covering the three load-bearing properties (context locality, drop safety, poisoning blast radius), the CPU-fallback exception, and pointers to the canonical narrative in crates/neuron/src/harness/device_worker/mod.rs's module doc-comment. - New 2026-05-27 addendum dating the migration and naming the four PR commits (Phase 1: `081b532`, Phase 2: `b179204`, Phase 3: `76ab24d`, Phase 4: `b4f3576`). Same convention as the 2026-04-15 and 2026-05-18 addenda. README.md: - One paragraph in "Node setup" noting the per-device thread pattern with a pointer to CLAUDE.md and the device_worker module. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:15:43 +03:00
rob thijssen	b4f3576d82	refactor(neuron): phase 4 — model loads move onto the device worker All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:24:38 +03:00
rob thijssen	76ab24d98c	refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker Some checks failed CI / Format (push) Successful in 29s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl` / `Job::Tp` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:16:02 +03:00
rob thijssen	b179204fd3	refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Resolve version stamps (push) Successful in 3m41s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build cortex binary (push) Successful in 12m20s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Second slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two spawn_blocking sites in `chat_completion` and `chat_completion_stream` now route through the device worker thread on CUDA loads. CPU loads keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's no context to own and the channel hop would only add latency. What this phase changes: - `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>` slab and a `next_handle` counter for minting opaque handles. - `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`, plus a new `arch_handle: Option<ArchHandle>` field. The two are mutually exclusive: CUDA loads set `arch_handle = Some(_)` after transferring the boxed arch into the worker's slab; CPU loads keep `arch = Some(_)` for the legacy spawn_blocking path. - New `run_inference_via_worker` and `stream_inference_via_worker` drive the prefill + decode loop by sending `Job::ForwardLogits` per step; the worker copies the resulting `[vocab]` logits to a CPU-side `Vec<f32>` before reply, so the async caller never holds a device-resident tensor. `apply_repeat_penalty` and `LogitsProcessor::sample` run on a CPU candle tensor; no context binding side-effects on tokio worker threads. - `logits_health_slice(&[f32])` complements the existing `logits_health(&Tensor)` so the new worker paths can compute health stats directly from the CPU vec. - `unload_model` for the single-GPU CUDA path now sends `Job::DropArch { handle }` to the worker so the `Box<ModelArch>` drops on the thread that allocated its CUDA tensors. The `Drop` runs with the bound context, freeing memory on the right context. What this phase doesn't touch (yet): - TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3. - Single-GPU model load — still spawn_blocking, followed by a `Job::TransferIn` to move the freshly-built `ModelArch` into the worker slab. Phase 4 moves the load itself onto the worker thread and eliminates the bootstrap TransferIn. - The `device_vram_mb` / `cuda_mem_mb` helpers — still present and used by the construction-time logs running inside spawn_blocking loads. Phase 4 cleanup folds them into `dispatch.rs`. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:55:08 +03:00
rob thijssen	081b532387	refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it Some checks failed CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Failing after 59s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m30s Details CI / Test (push) Successful in 4m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m16s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details First slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the infrastructure for a dedicated OS thread per CUDA device that owns the device's `CudaContext` for the daemon's lifetime, and routes the 8 async-context `device_vram_mb()` call sites in candle.rs through it. What this phase changes: - New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs). `DeviceWorkerHandle::spawn(idx)` creates a named OS thread (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel. Replies cross back via `tokio::sync::oneshot::Sender` so async callers await without parking a tokio worker. - Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward, ClearKv, NCCL init/sanity, and load variants. - `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at load time by a new `CandleHarness::ensure_device_worker(idx)` method that lazily spawns + caches one worker per device index. - Per-model `query_vram()` convenience method on both struct types so the 8 call sites in chat_completion / chat_completion_stream / chat_completion_tp_inner / chat_completion_tp_stream become `loaded.query_vram().await` (or `tp.query_vram().await`) — same field values logged, just sourced from the owner thread instead of the caller thread. What this phase doesn't touch (yet): - Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`. Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP forward + NCCL bring-up; Phase 4 moves the loads and deletes the now- unused `device_vram_mb` / `cuda_mem_mb` helpers. - Public API — unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical shapes. Tests: - 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query → shutdown round-trip, thread naming, post-shutdown submit returns `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across a Shutdown. CPU build (the only one CI runs) is enough to exercise channel mechanics. - All 37 lib tests + all integration tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:40:34 +03:00
rob thijssen	7c19da9361	feat(neuron): construction-complete vram/config dump + logits health + per-step vram All checks were successful CI / Format (push) Successful in 40s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Clippy (push) Successful in 2m27s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m0s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 5m10s Details build-prerelease / Build neuron-ada (push) Successful in 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details CI / Test (push) Successful in 4m24s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B incident from "guess at KV cache / quant sizes" into "read the journal": 1. Construction-complete summary in TpQwen3_5ForCausalLM::load and TpQwen3ForCausalLM::load. After the last "after layer N" log fires, each rank emits a single info line with: free_mb/total_mb (the number that drops by ~9 GB between per-layer and first-request on beast, with no inference traffic), every resolved config knob (vocab_size, hidden_size, num_layers, head_dim, num_kv_heads, max_position_embeddings), and a per-token KV-cache byte estimate. For Qwen3-Next also includes the linear/full-attention layer split so the hybrid architecture's cache cost is unambiguous. 2. Logits health snapshot on sample failure. Today the failure logs "A weight is negative, too large or not a valid number" with no context — was it a NaN cascade, an Inf, a negative weight? `logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts plus finite_min/max/mean on the failure path (zero cost on the success path) and emits a warn line just before the wrapper's terminal "failed, model marked poisoned" log. Wired into both the prefill and decode sample sites of the non-streaming AND streaming TP chat paths. 3. VRAM snapshot at prefill complete + every decode step. The "prefill complete" info line now carries vram_free_mb so the activations + KV growth from the prefill itself is visible. The per-step trace line gets vram_free_mb too, so an operator running with RUST_LOG=trace can watch headroom shrink token by token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:04:55 +03:00
rob thijssen	24e20dcb5c	feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m21s Details CI / Test (push) Successful in 4m40s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m38s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build neuron-ada (push) Successful in 4m45s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 9m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Operators can now define tier aliases in models.toml: [aliases] "helexa/small" = "Qwen/Qwen3-1.7B" "helexa/balanced" = "Qwen/Qwen3-8B" "helexa/large" = "Qwen/Qwen3.6-27B" A client request for `model: "helexa/small"` is resolved to the concrete model id at routing time. The gateway also rewrites the proxied body's `model` field to the concrete id so neuron sees a name that matches its loaded handle (otherwise the harness rejects the request). Motivated by the finger-in-the-wind benchmark: same "what's the capital of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with identical correctness. Aliases let clients pick a latency tier without hardcoding model ids, and let operators swap targets without changing client code. Changes: * cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>` + `resolve_alias(&str) -> &str`. Unit tests cover the basic resolution + TOML round-trip. * cortex-gateway: * `RouteDecision` gains `resolved_model_id: String`. `router::resolve` consumes aliases at entry and threads the concrete id through. * Handlers (chat_completions, completions, anthropic_messages streaming + non-streaming) rewrite the body's `model` field with `rewrite_model_in_body` before proxying, using the resolved id for metrics labels, LRU touch, and the body itself. * `/v1/models` (Pass 4) emits each alias as its own entry mirroring the target's `loaded` flag, feasible_on, and locations — clients browsing the endpoint see both names and can pick either. * `models.toml` declares the three tier aliases; `models.example.toml` documents the section as opt-in. * Integration tests verify: end-to-end alias→concrete request flow, alias surfacing in /v1/models, and no-op fall-through for non-alias model ids. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 16:10:41 +03:00
rob thijssen	becf61b9c1	feat(script): validate-neuron.sh waits for /health activation=ready All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m48s Details CI / Test (push) Successful in 5m2s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m11s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m25s Details build-prerelease / Build neuron-ada (push) Successful in 4m58s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 6m50s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds wait_for_ready() that polls /health until activation.state flips to "ready" (or the NEURON_LOAD_TIMEOUT deadline). Inserted between probe_health and the is_loaded/trigger_load step. Before this, running validate-neuron.sh right after deploy.sh raced the background pre-warm and failed in ~9 ms with "neuron not reachable" (the pre-2026-05-26 build) or with a partial-load error (the new build, where the listener binds before default_models finishes). The poll prints the in_progress model on each tick so an operator watching the log can see which model is delaying readiness. Backs off from 2s to 10s after the first few iterations so a long TP load doesn't spam. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:26:21 +03:00
rob thijssen	b9e7a76a7a	feat(gateway): surface mid-prewarm models as Loading on /v1/models The poller now fetches /health alongside /models on each neuron and stashes the activation snapshot on NodeState. The /v1/models handler gains a Pass 3 that synthesises Loading locations from each neuron's activation.in_progress and activation.pending lists, so a catalogued model that's mid-prewarm surfaces as `status: "loading"` rather than appearing absent (loaded=false, locations=[]). Without this, a client polling /v1/models during a beast restart sees Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then reappear. Now it stays visible the whole time with a clear status. Adds ModelStatus::Loading to cortex-core. The router's per-node priority loop gets an explicit (no-op) arm: Loading models aren't routable yet, and falling through to the catalogue cold-load path is the existing race — no worse than before, but tagged as a known follow-up needing neuron-side in-flight tracking on /models/load. New test_poller_captures_activation_from_health exercises the full round-trip: mock neuron with empty /models but a pre_warming /health → poller writes node.activation. Common test helpers gain spawn_mock_neuron_with_models_and_health and default_health_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:26:12 +03:00
rob thijssen	800498f530	feat(neuron): bind listener before pre-warm, surface activation in /health Some checks failed build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m26s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Test (push) Successful in 4m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two coupled changes addressing the 2026-05-26 validate-neuron failure where a fresh deploy of beast had /health unreachable for ~5 minutes while Qwen3.6-27B q5k materialised, even though systemd reported the unit as active. 1. main.rs no longer awaits load_default_models before binding axum. The listener binds first; pre-warm runs in a spawned background task that holds a read lock on the harness registry for the duration of its sequential load loop. Concurrent on-demand /models/load and /v1/chat/completions traffic still flow. 2. /health gains an `activation` field carrying: state pre_warming \| ready pending model ids queued but not started in_progress model id currently loading (Option) completed model ids loaded successfully this activation failed [{model_id, error}] for failed entries The field is `#[serde(default)]` so a pre-change cortex polling a new neuron — or vice versa — keeps working. `ActivationTracker` (new module `neuron::activation`) owns the RwLock-wrapped state; load_default_models takes a tracker reference and updates it per-model. NeuronState holds an Arc clone for the /health handler. Tests updated to construct trackers and assert state transitions (empty noop, two failures → ready with both in `failed`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:18:04 +03:00
rob thijssen	d3f2d50749	feat(deploy): per-host neuron config + pre-warm headline models All checks were successful CI / Format (push) Successful in 39s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m57s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m52s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds asset/neuron/{beast,benjy,quadbrat}.toml — per-host neuron.toml files keyed by the first dot-component of the host. deploy.sh now rsyncs the matching file to /etc/neuron/neuron.toml on each neuron and stops+starts the service so default_models is re-read. Headline model per host (drives /v1/models output immediately after a clean deploy): beast Qwen/Qwen3.6-27B (q5k, tp=2, devices=[0,1]) benjy Qwen/Qwen3-8B (bf16, devices=[0]) quadbrat Qwen/Qwen3-1.7B (bf16, devices=[0]) Removes the need to follow deploy.sh with `validate-neuron.sh beast Qwen/Qwen3.6-27B q5k 2` to surface the 27B in the catalogue — the neuron loads it itself on activation. The neuron loop now mirrors the cortex flow (stop → install/upgrade → sync config → start) so config-only changes pick up on subsequent deploys; previously a no-package-change deploy would silently leave the host on the old default_models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 14:05:54 +03:00
rob thijssen	2740e61a23	fix(neuron,candle): name lifetime on acquire_pool_lock All checks were successful build-prerelease / Resolve version stamps (push) Successful in 46s Details CI / Format (push) Successful in 46s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m8s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 5m25s Details build-prerelease / Build neuron-ada (push) Successful in 5m3s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 7m41s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Lifetime elision fails when a function has two reference parameters and returns a borrow: rustc can't infer whether the MutexGuard's lifetime ties to `pool` or `model_id`. The non-CUDA build skipped this code path (cfg-gated), so the error only surfaced on the GPU build at https://git.lair.cafe/helexa/cortex/actions/runs/162. The guard borrows the pool, so name the lifetime on `pool` and the return type. `model_id` keeps its independent (elided) lifetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:37:32 +03:00
rob thijssen	67f79c868f	fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain Some checks failed build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Format (push) Successful in 43s Details CI / Clippy (push) Successful in 2m46s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m32s Details CI / Test (push) Successful in 4m25s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two failure modes from the 2026-05-26 beast incident: 1. `unload_all_models` looped through models calling `unload_model`, logging individual failures at warn. The cumulative effect was a single warn line for the failed unload then "shutdown complete" — no signal that the model was actually still loaded. Now each unload is bounded by a 20s timeout, failures escalate to error, and a summary "leaving N model(s) loaded" line fires when anything is stuck so the operator knows the OS will reclaim VRAM after exit. 2. Returning `Ok(())` from `main` after the unload sweep dropped the tokio runtime, which then waited indefinitely on a CUDA-stuck spawn_blocking thread (the journal's "Stack trace of thread 2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec fired 2 minutes later, SIGABRT, core dump. Replacing the return with `std::process::exit(0)` skips the runtime drain and hands the OS a clean exit code; stuck threads get reaped with the process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:30:06 +03:00
rob thijssen	fc6ef0ee0f	feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups Once a CUDA driver error has hit a forward or kv-cache call, the device's context is unrecoverable in-process — subsequent kernels can hang (the failure mode seen on beast on 2026-05-26), return garbage, or trip another illegal-address. The harness now marks the model poisoned on any forward / spawn_blocking / TP-task failure, refuses further inference against it with a clear "unload and reload" error, and surfaces `status: "poisoned"` on `/models` so an operator running `curl beast:13131/models` (or cortex polling) can see the bad state. Without this, a single OOM on a too-large prefill quietly turned every subsequent request into a stuck wait on the pool lock; with it, the first request fails fast with the driver error in the journal and the client gets a usable 5xx instead of a hung connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:28:42 +03:00
rob thijssen	1385979e3d	feat(neuron,candle): log per-device VRAM at chat_completion start Every "starting" log line now carries vram_free_mb / vram_total_mb for the request's serving device (the leader device on TP). On the 2026-05-26 incident this would have made the 14k-token prefill OOM diagnosable from the first log line: with ~412 MB free, that prompt was never going to fit, and the operator could have caught the imbalance before the CUDA context got poisoned. `device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is kept separate to avoid coupling the inference path to the TP module. TpLoadedModel gains a `leader_device: Device` clone so the request path reads the device without locking the leader model (which would contend with an in-flight forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:26:23 +03:00
rob thijssen	0a1cfcd4d0	feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings Every chat completion path (single-GPU + TP, streaming + non-streaming) now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber prefixes every event with that span so `grep req_id=…` over journalctl reconstructs one request even when dozens overlap. Every path also emits a terminal log line on both success ("done", with prompt_tokens/completion_tokens/finish_reason/total_ms) and failure ("failed", with full anyhow chain + total_ms). Failures used to vanish silently — a request that hit a CUDA OOM left "starting" in the journal and no further trace. New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await` in both TP paths. It warns at 2s ("still waiting on pool lock") and re-warns every 2s thereafter, so queued requests stuck behind a deadlocked holder are visible immediately instead of looking like idle silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:25:11 +03:00
rob thijssen	ea0e0f7911	fix(neuron,tp): log leader forward errors with full context Worker rank failures were already surfaced at WARN, but the leader's own forward Result::Err was silently coerced to a `leader_ok=false` bool. When the leader and a worker both fail together — the typical shape of a CUDA OOM cascading into an illegal-address — the journal showed only the worker side and an operator had to guess what hit rank 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:22:30 +03:00
rob thijssen	aa88d37509	fix(gateway): full observability + stop leaking upstream bodies All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 42s Details CI / Clippy (push) Successful in 2m27s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details CI / Test (push) Successful in 4m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 4m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m7s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Comprehensive sweep across cortex-gateway's request handling. Every failure path now emits exactly one structured warn (or error) event on the cortex side with the wire-level detail an operator needs; the API response carries only a generic message plus, where useful, the upstream status code. proxy.rs::forward_request: - warn on network failure (network error, target URL). - warn on upstream non-2xx (status, target URL). Streaming body still passes through to the client; we just can't snippet without breaking the stream. - warn on response-build failure. - ProxyError::into_response no longer interpolates the inner error into the API body — generic "upstream request failed" / "failed to build response" instead. handlers.rs::chat_completions, handlers.rs::completions: - warn on missing model field, with handler= label. - warn on route resolve failure with model + error chain. The user-facing 404 keeps the RouteError Display string (which is short, informative, and contains no internal detail beyond the model id and config'd node names). handlers.rs::anthropic_messages: - warn on invalid Anthropic body, on translated-OpenAI serialise failure (which is internal), on route resolve, on upstream network error, on upstream non-2xx (with 512-char body snippet for parse errors), on upstream body read, on response parse. - All warns share consistent field shape: handler, model, node, url, status / error / body as applicable. - API response messages are now uniformly generic. - Adds an info-level "proxying request" log on the non-streaming path so successful proxies are also visible. handlers.rs::proxy_with_metrics: - still calls e.into_response() but proxy::forward_request already warn'd at the wire layer, so no double-log here. Tests: - All 32 existing unit tests + 22 gateway integration tests + 4 new router tests pass. - Tests that asserted on the "no healthy nodes" / "not found" strings still match because RouteError messages are preserved in the 404 user-facing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:17:26 +03:00
rob thijssen	0f00f72b47	fix(router,handlers): strip trailing slash from rewritten URL + log upstream failures Some checks failed build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two coupled bugs surfaced after `9b0ed0b`: 1. url::Url::parse("http://host:port").to_string() normalises the empty path to "/", so rewrite_loopback_host was returning "http://beast:13131/". Downstream callers then did format!("{endpoint}/v1/chat/completions") and produced a double-slash path that neuron's axum router 404'd with an empty body. Strip the trailing slash in the rewriter so the endpoint is a clean base string for concatenation. 2. The anthropic_messages handler returned the upstream's empty body to the API caller as `"upstream error: "` with no journal log on the cortex side. Operators had no way to see what happened. Add warn-level tracing on both upstream failure paths (network error and non-2xx) with model, node, target URL, status, and a 512-char body snippet. The API response now carries just `"upstream returned <status>"` — the implementation detail lives in the log. Updates the two existing rewrite tests for the no-trailing-slash output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:10:39 +03:00
rob thijssen	9b0ed0b57f	fix(router): rewrite loopback inference URLs to use neuron's host Some checks failed CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 41s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Clippy (push) Successful in 7m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m57s Details build-prerelease / Build cortex binary (push) Successful in 4m15s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s Details CI / Test (push) Failing after 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Neuron hardcodes its bind_url as `http://localhost:13131` (it can't reliably know its own externally-resolvable name). When cortex runs on a different host than the neuron it's routing to, blindly proxying to that URL hits localhost on the cortex box instead of the neuron. Cortex already knows each neuron's reachable host from cortex.toml. After fetching the inference URL from `/models/{id}/endpoint`, if the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1), swap it for the configured neuron host. Preserve the port and path from neuron's URL so a future harness serving inference on a different port than the management API still works. Adds `url` (already a transitive dep via reqwest) as a direct dep for the URL parsing. Tests cover: localhost rewrite, distinct inference port preservation, non-loopback passthrough, malformed input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:23:47 +03:00
rob thijssen	dc2a803266	fix(rpm): migrate legacy helexa-cortex firewalld service to `cortex` Some checks failed build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 1m1s Details CI / Clippy (push) Successful in 3m12s Details CI / Test (push) Successful in 4m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m52s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds a %posttrans scriptlet to cortex.spec that: - Removes the stale /etc/firewalld/services/helexa-cortex.xml left behind by an older packaging stream that named the service `helexa-cortex` and (in some build streams) carried wrong port numbers (9301/9302/9304). - Walks every active firewalld zone; for any zone where the legacy helexa-cortex service was enabled, swaps it out for the new `cortex` service (which the RPM ships at /usr/lib/firewalld/services/cortex.xml with the right 31313/31314 ports). - Reloads firewalld so the change takes effect without operator intervention. Operators on whom this happened were silently dropping inbound connections to cortex on 31313 — the active zone advertised a helexa-cortex service that listed unrelated ports, masking the correctly-defined vendor cortex service. helexa-neuron is unaffected: that spec already ships the vendor service as helexa-neuron.xml (namespaced from day one) and no stale /etc override files exist in the fleet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:12:51 +03:00
rob thijssen	e71181499e	feat(stage-8e-3): quantize lm_head in TP Qwen3-Next All checks were successful build-prerelease / Resolve version stamps (push) Successful in 42s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m23s Details build-prerelease / Build neuron-ada (push) Successful in 4m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m42s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the load spec has quant set and tie_word_embeddings is false, lm_head's (vocab_size, hidden_size) weight is quantized in-situ at load time along with all the per-layer linears. The non-tied case on Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2 bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of decode latency from the per-token logits matmul. Tied case (tie_word_embeddings=true) keeps the lm_head plain even when quant is set — quantizing the shared tensor would corrupt the embedding lookup, and the tied case already gets the memory win from only holding one copy. This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path. The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's the bottleneck for a model that actually needs TP at consumer scale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:53:14 +03:00
rob thijssen	ee663e5e99	fix(stage-8e-2e): bump quant prefill threshold to M > 64 Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The M > 8 threshold from 8e-2d activated forward_via_f16 on the test case (M=30) and slightly regressed prefill (143 -> 133 T/s). The dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms) eats the cuBLAS GEMM speedup at small M. Move the crossover to M > 64 so short prefills (typical for the validate probe) stay on the GGUF GEMV kernel where per-call cost is comparable but the dequant tax is zero. Long prefills still get the dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the fixed dequant cost. Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that needs either a dequant cache (gives back the ISQ memory win) or a fused dequant+gemm kernel. Both larger projects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:50:45 +03:00
rob thijssen	34f9b77d9d	feat(stage-8e-2d): route quantized matmul by M (prefill vs decode) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m40s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m58s Details build-prerelease / Build neuron-ampere (push) Successful in 5m14s Details build-prerelease / Package cortex RPM (push) Successful in 9m25s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details MaybeQuantLinear::forward picks between two QMatMul paths: - M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost is fixed per call, so it's amortised across the M tokens. - M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel on the quantized blocks directly. Requires f32 inputs so we still cast in/out at the boundary in that arm. Earlier 8e-2c sent everything through the GGUF GEMV kernel, which is excellent at GEMV (decode) but doesn't have a real batched GEMM path — prefill regressed ~4x. This restores prefill to roughly the bf16 cuBLAS GEMM throughput while keeping the decode gain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:15:32 +03:00
rob thijssen	f084aaab8e	fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul All checks were successful CI / Format (push) Successful in 33s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m18s Details CI / Test (push) Successful in 4m26s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m41s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 4m41s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly GGUF dequantize accumulates in f32. The model dtype flowing into MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with "unexpected dtype, expected: F32, got: BF16". Wrap the Quant arm to cast the activation to f32 before the matmul and cast the result back to the input dtype. The cast is a single launch on the activation tensor (small relative to weight traffic); it's the price of in-situ GGUF-style quantization, and what mistralrs does inside its own Linear wrapper. The Plain arm is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 20:05:19 +03:00

1 2 3 4 5

217 Commits