cortex

Author	SHA1	Message	Date
rob thijssen	7df84fed8f	feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6 All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 28s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 5m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m23s Details build-prerelease / Build neuron-ampere (push) Successful in 7m56s Details CI / Test (push) Successful in 7m11s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ada (push) Successful in 5m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Stage A of the vision implementation plan (doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding that today's silent-drop failure mode (issue #3) needs — the Qwen3.6 ViT loads from `model.visual.`, runs forward producing post-merger LM-side image embeddings, and routes through the device worker via a new `Job::EncodeImage`. No LM splice yet — that's Stage B. Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision), #13 (27B production deploy), #14 (dynamic resolution), #15 (numerical validation). What landed: - A0 — investigation: pulled config.json, preprocessor_config.json, chat_template.jinja, and safetensors index from beast's local Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with exact tensor shapes for every `model.visual.` weight. Confirms 27-block ViT with `hidden_size=1152`, `patch_size=16`, `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives in 2 of the 15 safetensors shards. - A1 — deps + scaffolding: added `image = "0.25"` (default- features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to crates/neuron/Cargo.toml. Created `harness::preprocess` and `harness::arch::qwen3_5::vision` modules. - A2 — preprocess.rs: `decode_data_uri` strips `data:image/...;base64,...` → image bytes → `image::DynamicImage` (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess` resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448), normalises to `[-1, 1]` per the model's mean/std=0.5, emits row-major `(3, H, W)` f32. 9 unit tests covering data URI parse, decode failure paths, grayscale-to-RGB promotion, and the exact-value normalisation contract. - A3 — vision.rs: `VisionTower` struct with `patch_embed: Conv2d`, learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN + multi-head self-attention with fused QKV + GELU-tanh MLP + residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat → linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size). Includes the Conv3d→Conv2d fold trick documented at the top of the file — the published patch_embed.proj.weight is 5D `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static images we sum-collapse the temporal axis. Video would need real Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh` reference values from PyTorch. - A4 — wire vision into Qwen3_5ForCausalLM: extended `Config` with optional `vision_config: Option<VisionConfig>` and `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision tower when present, exposes `has_vision()` and `vision()` so the HTTP layer can advertise capability and so the encode path can reach it. - A5 — device worker `Job::EncodeImage`: new job variant carrying CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the tensor on the worker's device, calls `arch.encode_image(image)`, copies the result back to CPU as flat `Vec<f32>`. Keeps the "tensors don't escape the worker" invariant. Poisoned-worker drain path handles the new variant. - A6 — dispatch round-trip test: `encode_image_routes_to_dispatch_ and_errors_on_unknown_handle` proves the channel/dispatch wiring works end-to-end via the CPU device worker (errors on unknown ArchHandle, which is the expected behaviour without a loaded model — real-weights validation happens in Stage B when the LM splice path exists). CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, zero failures). New test counts: +9 in preprocess, +5 in vision, +1 in device_worker. Out of scope (deferred): - LM-side splice of image embeddings at `<\|image_pad\|>` positions → Stage B. - Streaming SSE for vision-bearing chat completions → Stage C. - Reject `image_url` with HTTP 400 for non-vision models / advertise `capabilities` in /v1/models → Stage C. - TP-vision (#12), 27B production deploy (#13), dynamic resolution (#14), numerical validation (#15). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 11:40:47 +03:00
rob thijssen	d0292ed377	feat(cortex): catalogue source field + scheme-qualified /models/load Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 40s Details CI / Test (push) Failing after 1m3s Details CI / Clippy (push) Successful in 2m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m13s Details build-prerelease / Build neuron-ampere (push) Successful in 7m31s Details build-prerelease / Build neuron-ada (push) Successful in 8m16s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m21s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Build cortex binary (push) Successful in 4m5s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Phase 3 of plan-source-aware-loader-preflight. Adds an optional `source` field to `ModelProfile` and threads it through the router's cold-load path so a profile pointing at the helexa registry forwards `helexa:<id>` to neuron's `/models/load` instead of leaving neuron to substitute its `default_source` (typically `huggingface`). Without this, an operator who declares `source = "helexa"` in models.toml would still see neuron fetch from HuggingFace — the catalogue → ModelSpec translation in `profile_to_spec` was dropping the scheme on the floor. What lands: - `cortex-core::catalogue::ModelProfile.source: Option<String>`. None is the default and preserves pre-Phase-3 behaviour. - `cortex-gateway::router::qualified_model_id(profile)` — small pure helper, extracted from `profile_to_spec` so it can be unit-tested. Empty-string `source` is treated as None so operators who blank out a previously-set value don't trip a scheme-with-no-scheme failure mode in neuron. - `models.example.toml` documents the new field with a commented-out helexa-scheme example pointing back at neuron.example.toml's matching sources block. Tests: - 2 new unit tests in `cortex-core::catalogue`: source-absent round-trip and source-present round-trip through TOML. - 3 new unit tests in `cortex-gateway::router`: pass-through when None, prefix when Some, pass-through on empty-string source. - ModelProfile literal in catalogue's existing test updated to carry `source: None`. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (24 test groups ok, zero failures). Completes Phase 3. With Phases 1+2+3 landed: - neuron parses `scheme:org/name`, routes per-source hf-hub Api with disambiguated cache. - preflight returns structured errors before any device allocation. - cortex catalogue declares per-model source jurisdiction and forwards it to neuron. The registry itself (registry.helexa.ai service, MinIO, nginx, mirror fabric) is the next moving piece — landing under a separate project per the design discussion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 14:53:58 +03:00
rob thijssen	d4e1b05956	feat(neuron,cortex-core): source-aware loader (scheme:org/name) All checks were successful CI / CUDA type-check (push) Successful in 46s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m40s Details build-prerelease / Build cortex binary (push) Successful in 4m23s Details CI / Test (push) Successful in 5m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m18s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Phase 1 of plan-source-aware-loader-preflight. Makes neuron's loader treat `huggingface:org/name` and `helexa:org/name` as first-class distinct sources with per-source endpoint + cache, while staying backwards-compatible with bare `org/name` ids. Zero behavior change for existing operator configs. Motivation: helexa is adding an EU-hosted registry (`registry.helexa.ai`) alongside HF. Both speak HF-compatible wire format, but the bytes, jurisdiction, trust root, and cache namespace are distinct. The loader needs to disambiguate which registry serves a given model id, and to keep their caches from colliding on disk when both happen to host the same `org/name`. What lands: - `cortex-core::source` — new module. `ModelSourceId { scheme, org, name }` with `FromStr` accepting both `scheme:org/name` and bare `org/name`. `Display` round-trips. `repo_path()` emits the `org/name` half for the hf-hub `Api::model(...)` call regardless of which scheme/endpoint we're hitting. Rejects malformed input with typed `ParseError` variants (empty scheme, missing slash, scheme with `/`, name with `:`, etc.). - `neuron::config::CandleHarnessConfig` gains `default_source: Option<String>` and `sources: HashMap<String, SourceConfig>`. `SourceConfig` mirrors what `hf_hub::ApiBuilder` consumes: endpoint URL, optional `auth_env` (env var name read at startup so secrets stay out of TOML), and optional cache_dir. Defaults synthesise a `huggingface` entry pointing at `https://huggingface.co` with the legacy `hf_cache` field as its cache_dir — so existing configs that only set `hf_cache` keep working unchanged. - `CandleHarness::new(bind_url, &CandleHarnessConfig)` replaces `CandleHarness::new(bind_url, hf_cache)`. Resolves every configured source's auth env var and cache dir up front so `hf_api_for(scheme)` is a pure HashMap lookup on the hot load path. Only the `huggingface` scheme gets the legacy `HF_HUB_CACHE`/`HF_HOME` env-var fallback chain; other schemes resolve to whatever the operator typed. - `hf_api()` -> `hf_api_for(scheme)`. Builds an `hf_hub::Api` with the source's endpoint, cache_dir, and auth token. Errors with a useful message naming the configured schemes when an unknown scheme is requested. - `CandleHarness::load_model` parses `spec.model_id` into a `ModelSourceId`, substitutes `default_source` for bare ids, and threads the parsed source through `preflight`, `resolve_files`, `resolve_dense_files`, `load_arch_gguf`, `load_arch_dense`, and `load_tp`. The hf-hub `Api::model()` call now uses `source_id.repo_path()` so registry calls hit the right URL shape regardless of scheme. - `preflight()` signature gains a `&ModelSourceId` parameter (it's the canonical id for log lines and error display); `RepoFetchFailed.model_id` etc. now carry the scheme-qualified form so operator-visible errors echo exactly what was configured. - `neuron.example.toml` documents the new `[harness.candle.sources.*]` table with commented-out examples for `huggingface` (explicit override) and `helexa`. Tests: - 13 new unit tests in `cortex-core::source` covering parse / display round-trip, default-scheme substitution semantics, and every `ParseError` variant. - 6 new unit tests in `neuron::config` covering the `effective_sources` synth (legacy `hf_cache` carry-through, explicit override preservation, helexa-alongside-huggingface) and `effective_default_source` fallback. - 2 new unit tests in `harness::candle::tests` covering multi-scheme `hf_api_for` routing, including the "unknown scheme" error path naming configured schemes. - Preflight integration tests updated to construct `ModelSourceId` and assert against the scheme-qualified error form. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 24 test groups ok, zero failures). Out of scope (Phase 3): - Cortex catalogue `source` field — independent of Phase 1+2, ships when the registry comes online. - `helexa` source endpoint itself — separate project; this PR adds the client-side rails only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 13:42:11 +03:00
rob thijssen	61adff347a	feat(neuron): preflight placement check with structured errors Some checks failed CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 48s Details CI / Test (push) Failing after 1m10s Details CI / Clippy (push) Successful in 2m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m53s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Phase 2 of plan-source-aware-loader-preflight. Adds a one-RTT placement feasibility check that runs before any device allocation, NCCL handshake, or weight fetch. Replaces today's opaque "fetch config.json … 404" failure mode (when an operator points `tensor_parallel = 2` at a GGUF-only repo) with a structured error that names the failure class and points at the fix. What lands: - `crates/neuron/src/harness/preflight.rs` — new module. Classifies a repo's siblings listing into `SourceFormat` (Gguf \| DenseSafetensors \| Mixed \| Empty), applies the tp/quant feasibility table, returns a `PlacementPlan` on success or a typed `PreflightError` on rejection. `PreflightError` is `serde::Serialize` so the HTTP layer can emit the structured shape verbatim; it's `thiserror::Error` so log lines get a single-line Display when downcasting from anyhow. Includes best-effort Levenshtein-nearest suggestion for malformed quant names (the second sharp edge the HauhauCS scenario surfaced — operator writes `q6k` against filenames containing `Q6_K_P`, and today's matcher just says "no GGUF file matching quant"). - `CandleHarness::load_model` — calls `preflight(...)` first thing after the "already loaded" guard, before any `ensure_device_worker` or `resolve_*`. Failure wraps the typed error in `anyhow::Error` so the existing trait surface is unchanged; the HTTP handler and the startup logger downcast to recover the structured form. - `crates/neuron/src/api.rs::load_model` handler — maps `PreflightError` to 422 Unprocessable Entity with `{"error": {"kind": "...", "model_id": "...", "suggestion": "..." }}`. Other failures keep the existing 400 + free-form `format!("{e:#}")` shape. - `crates/neuron/src/startup.rs::load_default_models` — when the failure is a preflight rejection, log as `reason=<kind> detail=<msg>` instead of the opaque `error=<chain>`, so journalctl on beast will now show `reason=tp_requires_safetensors detail="repo is GGUF-only (8 .gguf files); TP requires dense safetensors..."` instead of `error=fetch config.json from HauhauCS/...: 404 Not Found`. Tests: - 18 unit tests in `harness/preflight.rs` covering classifier, quant matching, Levenshtein, error serialization, and the full feasibility table (gguf+tp rejected, gguf+bad-quant suggests nearest, gguf+good-quant ok, dense+tp ok, empty rejected, mixed prefers safetensors). - 7 integration tests in `tests/preflight.rs` exercising the network path through an axum mock that serves hf-hub-compatible `/api/models/{org}/{name}/revision/main` payloads. Adds `tempfile` as a dev-dependency for per-test cache dirs. Out of scope (deferred to subsequent phases): - Phase 1 (source-aware loader plumbing — `scheme:org/name` parsing, per-scheme `SourceConfig`, cache disambiguation). Preflight runs against the single configured HuggingFace source today; the scheme threading lands cleanly when Phase 1 ships. - Phase 3 (cortex catalogue source field). - GGUF tensor-parallel loading. Preflight rejects this combination with `TpRequiresSafetensors`; the underlying loader gap is the separate `Helexa` curated-registry / heretic-rs conversation. Refs #4-#9 architectural follow-up; no specific issue closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 13:24:30 +03:00
rob thijssen	435fd10902	fix(neuron): macro-ify CUDA single-GPU route_token so DecodeStream type stays inferred All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 29s Details CI / Clippy (push) Successful in 2m47s Details build-prerelease / Build cortex binary (push) Successful in 4m27s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 8m30s Details build-prerelease / Build neuron-ada (push) Successful in 5m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m11s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m1s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details Prerelease build (run 270) failed on commit `cb30383` with: error[E0107]: struct takes 5 generic arguments but 0 generic arguments were supplied --> crates/neuron/src/harness/candle.rs:3554:41 \| 3554 \| decode_stream: &mut tokenizers::DecodeStream<'_>, \| ^^^^^^^^^^^^ The Step-2-era refactor for #6's tool-call extraction added a nested `async fn route_token` inside `stream_inference_via_worker` that named `tokenizers::DecodeStream<'_>` as a parameter type. `DecodeStream` actually has five generic parameters (`'tok, M, N, PT, PP, D`) which makes naming it explicitly painful — the working approach the CPU path uses is a macro, where the body expands inline at the call site and the decoder type stays inferred. This commit replicates the CPU-side macro for the CUDA worker path. Same shape, just with `.await` calls inside (macros tolerate that since they expand inline into the enclosing async context). Control flow uses a labelled-block + `consumer_alive` flag rather than `return` so the macro stays generic over the surrounding return type. The CPU build (default-feature workspace, what `clippy` and `test` jobs exercise) doesn't compile this `#[cfg(feature = "cuda")]` branch, which is why local CI green-lit it. The cuda-check job should catch this category of breakage now that #cb30383+CI-fix landed; this commit just resolves the actual breakage on the prerelease workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 08:59:56 +03:00
rob thijssen	cb303832bc	feat(neuron): render the model's chat_template with chat_template_kwargs Some checks failed CI / CUDA type-check (push) Failing after 58s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 40s Details build-prerelease / Build neuron-ampere (push) Failing after 1s Details CI / Clippy (push) Successful in 2m37s Details build-prerelease / Build cortex binary (push) Successful in 4m47s Details CI / Test (push) Successful in 6m13s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ada (push) Failing after 7m20s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Closes #9. Replaces the hardcoded `format_qwen3_prompt` ChatML glue with `minijinja`-driven rendering of the model's own `chat_template` from `tokenizer_config.json`. The request's `chat_template_kwargs` flow into the Jinja context so model-specific levers (Qwen3's `enable_thinking: false`, etc.) actually take effect. ## Implementation - New `harness::chat_template` module with three entry points: - `load_chat_template_alongside(tokenizer_json_path)` — probes `tokenizer_config.json` in the same hf-hub snapshot directory. Supports both the canonical string-form `chat_template` and the array-form some tokenizers ship (multi-template models). - `render_chat_template(template, messages, tools, kwargs)` — renders via `minijinja`. Messages flatten into the `[{role, content}]` shape HF templates iterate, with per-message extras (`tool_calls`, `tool_call_id`) preserved. `tools` and `kwargs` add into the Jinja context so templates that reference them work without us interpreting their shape. - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE` (default true). Falsy values force the fallback path everywhere — a kill switch for emergency rollback without a rebuild. - `LoadedModel.chat_template: Option<String>` and the TP equivalent are populated once at load time. `None` (no tokenizer_config.json, parse error, missing field) routes the fallback path silently; logs go through `tracing::debug`/`warn` per condition. - New `build_prompt_for_request(chat_template, request)` wraps the decision: when both the template is present AND the kill switch is off, render with kwargs from `request.extra` (looks up `chat_template_kwargs` and `tools` lazily). On render error → warn + fallback to `format_qwen3_prompt`. Wired into all four current prompt-build sites (single-GPU stream + non-stream, TP stream + non-stream). ## Dependency `minijinja = "2"` with the `builtins`, `json`, and `serde` features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used internally by HF's `tokenizers-rs` for its own chat templating; the API surface we touch (`Environment::add_template` + `Template::render(serde_value)`) is stable. ## Validation strategy I can't byte-compare the new path's output against `format_qwen3_prompt` for live models without GPU (CI doesn't have one). The fallback path and kill switch are the mitigations — a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the neuron service env if the chat template renders surprisingly on Qwen3-8B in production. The legacy formatter stays the fail-closed default. ## Scope cuts (documented in module header) - Tool-definition lifting from helexa-acp's system-prompt injection into the chat_template's native tools block is deferred. Today the request's `tools` array threads into the Jinja context, but helexa-acp continues to inject Hermes-format tool descriptions into the system prompt for backwards-compat with non-cortex endpoints. ## Tests 9 unit tests in `chat_template`: kill-switch matrix (truthy / falsy / unset), template loading (string form, array form, missing file, unparseable JSON, missing field), rendering (basic conversation threading, kwargs forwarding, message-extras threading for tool_calls). 215 workspace tests pass; clippy + fmt clean across all workspace features (default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:43:11 +03:00
rob thijssen	44008358c5	feat(neuron): emit response.in_progress between created and output_item.added Some checks failed build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 44s Details CI / Test (push) Failing after 1m5s Details CI / Clippy (push) Successful in 2m36s Details CI / CUDA type-check (push) Failing after 52s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m42s Details build-prerelease / Build neuron-ampere (push) Failing after 7m14s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Refs #7. OpenAI's Responses API spec emits `response.in_progress` between `response.created` and the first output-item event to mark "request validated, model is generating". Some Responses-API clients distinguish loading-spinner vs streaming-spinner UI based on which event arrived last; emitting both keeps the wire shape matched. Carries the same shell as `response.created` (status=in_progress, empty output, no usage yet) — both events are payload-light bookkeeping, distinguished only by the event name. The hosted-tool event families remaining in #7 (web_search_call, code_interpreter_call, file_search_call, image_generation_call) stay deferred until the underlying tools exist in neuron. Updated `full_stream_emits_expected_event_sequence` to assert the new event lands in position 1; downstream indexing shifted by one across the existing test assertions. CI green, fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:30:34 +03:00
rob thijssen	fc9a8c42a3	feat(neuron): extract `<tool_call>` blocks to structured tool_calls deltas Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Failing after 17s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 32s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Closes #6. Same model-agnostic seam as #8 but for tool-call markers (`<tool_call>` / `</tool_call>` on Qwen3-Coder, Hermes-format, DeepSeek-Coder, gpt-oss, …). Lets Zed's tool-use feature and any other vanilla OpenAI chat client get structured `tool_calls` deltas out of cortex without having to parse markers themselves. ## Implementation 1. Tokenizer probe at load time (`detect_tool_call_token_pair` in `wire::event`) — same shape as the reasoning-marker probe from #8. Both open AND close must resolve to single token ids; non-tool-use models get `None` and pass through unchanged. Stored on `LoadedModel.tool_call_tokens` and the TP analogue. 2. New `InferenceEvent::ToolCall` variant — carries `index` (call slot, per-turn counter), generated `id` (`call_<hex>_<idx>`), `name`, and the complete `arguments` JSON string. One event per parsed call. 3. Token-level state machine in all three streaming paths (CPU `run_inference_streaming`, CUDA single-GPU `stream_inference_via_worker`, CUDA TP `chat_completion_tp_stream`) layered on top of #8's reasoning routing: - `<tool_call>` token → enter buffering state, clear buffer. - Tokens while buffering → accumulate into `tool_call_buf` via the decoder (so multi-byte UTF-8 still buffers correctly) without emitting anything visible. - `</tool_call>` token → take the buffer, parse with `parse_tool_call_body` (extract `name` + `arguments`), emit a structured `ToolCall` event with a fresh `call_<hex>` id and the parsed fields. - On parse failure → fall back to re-emitting the original `<tool_call>{buf}</tool_call>` block as plain text content so helexa-acp's existing `ToolCallParser` repair passes still have a chance to recover the call. 4. OpenAI chat projector emits the OpenAI streaming `tool_calls` delta shape on `InferenceEvent::ToolCall` — `{tool_calls: [{index, id, type:"function", function:{name, arguments}}]}`. One chunk per call slot. 5. OpenAI Responses projector drops `ToolCall` events for now (Responses-side function_call event family routing tracked under #7); the chat path is what unblocks Zed's tool use today. ## Acceptance - Vanilla OpenAI chat clients (Zed's tool-use feature, any other OpenAI-compatible tool-call consumer) get structured tool_calls deltas against cortex+neuron without having to parse `<tool_call>` markers in content. - helexa-acp continues to work — when neuron parses cleanly, it consumes the structured deltas through its existing decoder. When the model emits malformed JSON, neuron falls back to text pass-through and helexa-acp's `ToolCallParser` recovers via the same path it always did. - Models without tool-call markers in their tokenizer pass through unchanged. - No hardcoded model knowledge — entirely driven by tokenizer metadata. ## Tests 2 new detection tests in `wire::event` (Qwen3-style marker detection, no-marker case). The streaming paths themselves stay covered by the existing chat-completions integration tests; full end-to-end exercise of the new path requires GPU-loaded models and lives outside the CI test surface. 215 workspace tests pass; clippy + fmt clean across the workspace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:26:31 +03:00
rob thijssen	7733eecba5	feat(neuron): strip reasoning from chat completions by default Some checks failed CI / CUDA type-check (push) Failing after 18s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details CI / Test (push) Successful in 5m19s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m56s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m45s Details build-prerelease / Build neuron-ada (push) Successful in 5m24s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Closes #8. Reasoning-capable models (Qwen3, DeepSeek-R1, gpt-oss, Mistral Magistral, …) emit `<think>...</think>` blocks inline in their content stream. The chat-completions wire format has no slot for reasoning, so until this change every consumer either parsed the markers themselves (helexa-acp) or wrote the raw scratchpad content into their UI (Zed's commit-message generator — visible as the leaked reasoning block on every generated commit message against benjy's Qwen3-8B). ## Implementation, model-agnostic by design The neuron side now does token-level routing without any hardcoded model knowledge: 1. At load time (`detect_reasoning_token_pair` in `wire::event`), probe the tokenizer's vocabulary for a known reasoning-marker pair: `<think>` / `</think>` (Qwen3, DeepSeek-R1, gpt-oss), `[THINK]` / `[/THINK]` (Mistral Magistral), and a couple of derivatives. Each marker must resolve to a single token id; if both open and close resolve, stash on `LoadedModel.reasoning_tokens` (similarly `TpLoadedModel`). Non-reasoning models get `None` and pass through unchanged. 2. At inference time, the three streaming paths (`run_inference_streaming` CPU, `stream_inference_via_worker` CUDA single-GPU, `chat_completion_tp_stream` CUDA TP) now check each sampled token against the pair via the new `handle_reasoning_marker` helper before feeding it to the detokeniser. Open marker → set `in_reasoning = true`, drop the marker. Close marker → unset, drop. Other tokens go through `emit_delta(_blocking)` which now picks `ReasoningDelta` or `TextDelta` based on state. Markers never appear in the streamed output. 3. In `wire::openai_chat`, the projector splits into: - `project_chat_stream` (unchanged signature; default behaviour — drops `ReasoningDelta`) - `project_chat_stream_with(rx, …, ChatProjectionConfig)` — when `include_thinking: true` and `reasoning_markers: Some(_)`, re-wraps reasoning content with the literal open/close marker text and emits as content deltas. Preserves the on-the-wire shape that helexa-acp's `ThinkParser` expects. 4. HTTP handler reads `x-include-thinking: true` (case- insensitive `1`/`true`/`yes`) from the request headers and threads it into the projection config. cortex-gateway already forwards arbitrary headers verbatim, so the opt-in works end-to-end without gateway changes. 5. helexa-acp's `openai_chat` provider sets `x-include-thinking: true` on every request so its existing `ThinkParser` keeps receiving the marked content stream. `ThinkParser` itself is unchanged — needed for endpoints that aren't reasoning-aware (OpenRouter, OpenAI directly, etc.). ## Acceptance - Zed's commit-message generator (vanilla chat-completions client, no `x-include-thinking`) gets clean commit messages with no `<think>` block. - helexa-acp sessions continue to render thinking in Zed's thought UI via the opt-in path. - Models without reasoning tokens declared in their tokenizer pass through unchanged. - Implementation contains zero references to "qwen3" or any specific model — entirely driven by tokenizer metadata. ## Tests 9 new tests in `wire::event` (token-pair detection across 4 marker conventions, edge cases) and `wire::openai_chat` (default drop, opt-in re-wrap with multi-chunk reasoning, close-marker on Finish, fallback when markers absent, off-switch with markers present). All 213 workspace tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 17:55:04 +03:00
rob thijssen	fdc0adb738	docs(helexa-acp): README + example config for end-user onboarding Some checks failed CI / CUDA type-check (push) Failing after 18s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details CI / Test (push) Successful in 5m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m40s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Stage 7. Walks a new user from "never heard of helexa-acp" to "chatting via Zed against helexa or a public API in 10 minutes": - crates/helexa-acp/README.md — install (from source / COPR), quick-start env-var path, multi-endpoint TOML, full Zed setup, endpoint cookbook (cortex/neuron, OpenAI, Anthropic, OpenRouter, LM Studio, multi-cortex), three session modes (Default / Bypass / Plan) with their tool tables, tool surface + path-handling rules, session resume, context compaction, troubleshooting for the five failure modes a new user is likely to hit, and architecture reference for contributors. - helexa-acp.example.toml — copy-paste-and-edit starter config at the repo root, mirroring the existing cortex.example.toml / neuron.example.toml pattern. No code changes. fmt + clippy clean as a sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:25:56 +03:00
rob thijssen	8fa1d1962e	feat(helexa-acp): anthropic-messages provider Some checks failed CI / CUDA type-check (push) Failing after 18s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Test (push) Failing after 59s Details CI / Clippy (push) Successful in 2m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m50s Details build-prerelease / Build neuron-ada (push) Successful in 5m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Stage 6b. Third provider impl, completing the wire-format trio (openai-chat, openai-responses, anthropic-messages). Lets a helexa-acp endpoint configured with `wire_api = "anthropic-messages"` drive Claude models — either against Anthropic directly or via cortex's /v1/messages translation surface. ## Encoder (CompletionRequest → Anthropic body) - System messages flatten to the top-level `system` field (concatenated with blank lines when there are multiple). - User text → `{role:"user", content:"..."}`. - User MultiPart (text + images) → `content` array with Anthropic's distinct image shape: `{type:"image", source:{type:"base64", media_type, data}}` — structurally different from OpenAI's `image_url` data URI. - Assistant text → `{role:"assistant", content:"..."}`. - Assistant tool_calls → `content` array with optional `{type:"text"}` block plus one `{type:"tool_use", id, name, input:<parsed json>}` per call. The internal arguments JSON string is parsed back to a Value before encoding (Anthropic requires the parsed form); malformed JSON falls back to a String input so the request body still serialises. - Tool result → `{role:"user", content:[{type:"tool_result", tool_use_id, content}]}` per Anthropic's convention (no separate `tool` role). - `max_tokens` is required by Anthropic; defaults to 8192 when the request doesn't specify. ## Decoder (Anthropic SSE → CompletionEvent) Named SSE events: - `message_start` → captures input_tokens from `usage` for the eventual UsageStats. - `content_block_start` (type=text) → TextDelta (initial text, if any). - `content_block_start` (type=tool_use) → ToolCallStart; if a pre-buffered `input` is present, also emits a single ToolCallArgsDelta. - `content_block_start` (type=thinking, for extended-thinking models) → ReasoningDelta. - `content_block_delta` (text_delta) → TextDelta. - `content_block_delta` (input_json_delta) → ToolCallArgsDelta, correlated by block index. - `content_block_delta` (thinking_delta) → ReasoningDelta. - `message_delta` → Usage (final output_tokens) + Finish with stop_reason mapped: end_turn/stop_sequence → "stop", max_tokens → "length", tool_use → "tool_calls". - `message_stop` → stream terminates. - `ping` ignored (Anthropic's keep-alive). - `error` → yields Err and ends the stream. ## Wiring - Authentication: `x-api-key` + `anthropic-version: 2023-06-01` headers (not Bearer). Both ship when api_key is configured; servers that don't care (cortex) ignore them. - `WireApi::AnthropicMessages` in build_provider now constructs the provider instead of erroring "reserved for future". - `provider::mod.rs` registers the new module. 18 new unit tests: encoder (system collapse, multi-system concat, default max_tokens, multipart with image, tool_use blocks, tool results, malformed JSON arg fallback), decoder (text streaming, tool_use lifecycle, max_tokens→length mapping, empty deltas, ping events, error events, cancellation, malformed payload skip, thinking blocks). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:01:59 +03:00
rob thijssen	1818dfb337	feat(helexa-acp): openai-responses provider Some checks failed CI / Format (push) Successful in 38s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Clippy (push) Successful in 2m35s Details CI / CUDA type-check (push) Failing after 12s Details CI / Test (push) Successful in 5m54s Details build-prerelease / Build cortex binary (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m36s Details build-prerelease / Build neuron-ampere (push) Successful in 7m11s Details build-prerelease / Build neuron-ada (push) Successful in 6m33s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Stage 6a. Implements the `Provider` trait for OpenAI's Responses API surface, parallel to the existing `OpenAIChatProvider`. Lets a helexa-acp endpoint configured with `wire_api = "openai-responses"` drive a `/v1/responses` server (today: neuron through cortex; later: OpenAI directly) using the same agent-loop machinery the chat provider already supports. ## Encoder (CompletionRequest → Responses body) - System messages collapse into a single top-level `instructions` string. Multiple system messages concatenate with blank lines so ordering is preserved. - User messages become `{type:"message", role:"user", content:…}` input items. Text content stays a bare string; MultiPart content (text + images, post-Stage 5) becomes a `[{type:"input_text"}, {type:"input_image"}]` array with images encoded as `data:{mime};base64,{data}` URIs — exactly the shape neuron's `wire::openai_responses::request_to_chat` accepts. - Assistant text turns become an `output_text` content part inside a `message` item. - Assistant tool-call turns become `function_call` input items. - Tool result turns become `function_call_output` input items. - `max_tokens` translates to `max_output_tokens`. ## Decoder (Responses SSE → CompletionEvent) Reads named events on the SSE `event:` line: - `response.output_text.delta` → `CompletionEvent::TextDelta` - `response.output_item.added` with `type:"function_call"` → `CompletionEvent::ToolCallStart` (and, when the upstream pre-buffers fully, a single `ToolCallArgsDelta`) - `response.function_call_arguments.delta` → `CompletionEvent::ToolCallArgsDelta`, correlated back to the tool-call slot by output_index. - `response.completed` → `CompletionEvent::Usage` (if present) + `CompletionEvent::Finish` with reason mapped from `status`: `"completed"` → `"stop"`, `"incomplete"` → `"length"`. - Bookkeeping events (`response.created`, `response.in_progress`, `.content_part.`, `.output_text.done`, `.output_item.done`, `.function_call_arguments.done`, reasoning_) are skipped. ## Wiring - `EndpointConfig::responses_url()` joins `{base_url}/responses`. - `WireApi::OpenAiResponses` in `build_provider` constructs the new provider (was previously a "reserved for future" error). - `provider::mod.rs` registers the new module. ## Cuts (carried over from neuron-side issues) - The decoder's `ToolCall` handling fires correctly when the upstream emits `function_call` items, but the neuron candle harness doesn't yet (Refs #6). Real tool-call testing against cortex+neuron stays on the chat path until #6 lands. - Reasoning events (`response.reasoning_`) are deliberately dropped today; once neuron emits `InferenceEvent::ReasoningDelta` (Refs #5) the projector on the neuron side will start firing the reasoning event family and this decoder will need a matching case to route them to `CompletionEvent::ReasoningDelta`. 13 new unit tests cover encoder (system collapse, multipart user input, assistant output_text encoding, tool-call round-trip via function_call items) and decoder (text streaming, empty deltas dropped, length finish, function_call lifecycle, inline-arguments shape, cancellation, malformed payload skip). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:30:25 +03:00
rob thijssen	5ed1140c97	feat(cortex-gateway): proxy /v1/responses to neuron Some checks failed CI / CUDA type-check (push) Failing after 12s Details build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Failing after 1m5s Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details CI / Test (push) Successful in 5m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Step 3 of the Responses rollout: plain proxy route on the gateway, no translation. Neuron speaks the Responses API natively after Step 2 (commit `957f704`), so the gateway just needs the same routing shape it uses for /v1/chat/completions — extract `model`, resolve via router::resolve, forward verbatim. - New `POST /v1/responses` handler in handlers.rs::responses. - Mock neuron under tests/common/mod.rs gains a `/v1/responses` endpoint that mirrors the ResponsesResponse shape neuron emits. - New integration test file `tests/responses.rs` exercises: - Happy path (200, body round-trips, ResponsesUsage shape). - Unknown model → 404 (matches chat-completions error shape). - Missing `model` field → 400 (same extract_model helper). Streaming proxy works through the same path as chat completions — the upstream Content-Type (`text/event-stream` for stream:true, `application/json` otherwise) propagates through proxy_with_metrics unchanged. Live-stream integration tests against a streaming mock deferred until we exercise the path against a real neuron, since the chat-completions streaming test already covers the proxy's SSE forwarding mechanics. Three new tests; clippy + fmt clean across the workspace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:21:43 +03:00
rob thijssen	957f704efa	feat(neuron): OpenAI Responses API + ci cuda-check runner label Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / CUDA type-check (push) Failing after 11s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text \| items), `ResponsesInputItem` (message \| function_call \| function_call_output \| reasoning), `ResponsesContentPart` (input_text \| input_image \| output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit `1859777`: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:13:44 +03:00
rob thijssen	6927286cab	fix(neuron): clone id/model_id before TP spawn so wire projector can use them Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 39s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m34s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build cortex binary (push) Successful in 5m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 7m38s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The Step 1 refactor moved the InferenceEvent receiver wrap to after the orchestration spawn in chat_completion_tp_stream, but the spawn moves both `id` and `model_id` into its async closure (used heavily by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck error E0382 use-of-moved-value on the wire_chat::project_chat_stream call. The non-CUDA build doesn't exercise this branch (it lives behind `#[cfg(feature = "cuda")]`) which is why the workspace clippy/test gate passed locally and on the regular CI workflow. The RPM build workflow, which compiles with --features cuda, caught it (run 244 jobs 2/3/4 against beast / ampere / ada respectively, all the same error). Fix: snapshot `id` and `model_id` into `projector_id` / `projector_model_id` before the spawn, use those at the projector call site. The originals stay free to be moved into the closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 09:37:10 +03:00
rob thijssen	302ccfb982	refactor(neuron): introduce InferenceEvent + wire projection layer Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 3m28s Details build-prerelease / Build neuron-blackwell (push) Failing after 6m4s Details build-prerelease / Build neuron-ampere (push) Failing after 7m20s Details CI / Test (push) Successful in 7m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Failing after 4m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Step 1 of the OpenAI Responses API rollout. Pure refactor — no new endpoints, no behaviour change on the wire. Lays the seam for emitting Responses-shaped streaming events from the same harness output as chat completions in Step 2. - New `neuron::wire` module tree: - `wire::event::InferenceEvent` — format-agnostic enum (Start, TextDelta, ReasoningDelta, Finish) the candle harness now emits as its native streaming currency. - `wire::event::FinishReason` — typed reason that maps cleanly onto OpenAI `finish_reason`, OpenAI Responses `status`, and Anthropic `stop_reason` strings. - `wire::openai_chat::project_chat_stream` — async task that consumes an InferenceEvent receiver and produces a ChatCompletionChunk receiver, stamping per-request metadata (id, created, model_id) onto every chunk. Output matches the pre-refactor wire shape bit-for-bit. - candle.rs refactored to emit InferenceEvent on its internal channel through all three streaming paths (CPU run_inference_streaming, CUDA single-GPU stream_inference_via_worker, CUDA TP chat_completion_tp_stream). The streaming functions lost their id/created/model_id parameters since wire-format metadata now lives in the projector. - emit_delta + emit_delta_blocking simplified to single-purpose TextDelta emitters with no wire-format coupling. - chat_completion_stream wraps the InferenceEvent receiver in wire_chat::project_chat_stream before returning so the /v1/chat/completions HTTP handler keeps consuming ChatCompletionChunks unchanged. External signature preserved. Also fixes a pre-existing helexa-acp test race (three modules each declared their own static LOCK for HOME mutation, so cross-module parallelism flaked tests that read HOME at runtime). Consolidated onto a single crate-wide path_util::ENV_LOCK. 122 helexa-acp tests + 44 neuron tests pass (5 new wire projection tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp suite 3x to confirm the env race is closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:30:17 +03:00
rob thijssen	df0abfe4d4	feat(helexa-acp): image input for vision-capable models All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m33s Details CI / Test (push) Successful in 5m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m2s Details build-prerelease / Build neuron-ampere (push) Successful in 7m49s Details build-prerelease / Build neuron-ada (push) Successful in 5m27s Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Stage 5. Zed clipboard/DnD images get forwarded as OpenAI content-array messages on user turns. - New MessageContent::MultiPart variant + MessagePart (Text\|Image) + ImageData struct (mime_type, base64 data, optional uri). - flatten_prompt now produces structured content: collapses to Text when every block is text (some upstreams treat array-form as vision-only and refuse on text-only models), otherwise produces MultiPart preserving block order. - OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url", image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user messages. Data URIs are used over remote `uri` because they round-trip through every upstream we care about. - prompt_capabilities.image = true at initialize so Zed actually sends image blocks. - compaction estimates ~512 tokens per image (the middle of the Qwen3-VL / OpenAI detail range) so the budget tracker doesn't pretend images are free. - session/load replays image-bearing user turns by surfacing the text parts verbatim and rendering each image as a "[image: {mime} ({n} bytes)]" placeholder chunk — Zed can show the prior text context even though re-uploading the bytes through ACP isn't meaningful for resume. - 4 new tests: flatten produces MultiPart in block order, image-only prompts still flatten to MultiPart, encoder emits the correct array shape, text-only encoding stays as the string form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 09:43:00 +03:00
rob thijssen	b9016571f6	feat(helexa-acp): expand ~ / $HOME and fall back to local fs on ACP read errors Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 50s Details CI / Clippy (push) Successful in 2m34s Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details CI / Test (push) Successful in 5m13s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 8m15s Details build-prerelease / Build neuron-ada (push) Successful in 5m23s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Two related polish fixes for daily use: - New `path_util` module expands `~`, `~/…`, `$HOME`, and `$HOME/…` prefixes in every tool that takes a path (read_file, write_file, edit_file, list_dir, bash cwd). The expansion is also applied to the plan-mode write gate so `~/.local/share/helexa-acp/plans/…` comparisons behave correctly regardless of which form the model emits. - `read_file` now falls back to `std::fs::read_to_string` when ACP's `fs/read_text_file` errors out. Zed's workspace-scoped read was the source of "model can't see ~/git/architecture/generic.md" when the session cwd is a different project; the fallback lets the agent pull in shared material that lives outside the active workspace, the same way `list_dir` already does via local `std::fs::read_dir`. Local fallback honours line/limit args. The fallback also produces a combined error message when both ACP and local-fs reads fail, so the model sees what actually broke rather than just the ACP-side error. 14 new unit tests cover path_util's prefix matrix, fallback success/failure paths, and the line/limit slicing in fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 09:28:58 +03:00
rob thijssen	adbc52bfcd	feat(helexa-acp): model picker + session/set_model handler All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m32s Details build-prerelease / Build cortex binary (push) Successful in 4m45s Details CI / Test (push) Successful in 5m52s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 7m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 4m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Stage 4. Zed's model dropdown now lists every model from every configured endpoint, and switching it routes the next prompt to a new endpoint+model. - Enable `unstable_session_model` on the agent-client-protocol dep so SessionModelState / SetSessionModelRequest / ModelInfo are available. - Agent::new becomes async and calls Provider::list_models on every provider at startup; per-endpoint failures warn-and-skip instead of aborting the agent. - With a single endpoint configured, model ids appear bare; with multiple endpoints every id carries the `endpoint:` prefix so the picker is unambiguous and parse_model_selector routes correctly. - NewSessionResponse and LoadSessionResponse attach SessionModelState with the session's current model id + the aggregated catalogue. - session/set_model: validates the requested model id against resolve_provider, mutates session.model_id, and persists so the on-disk transcript reflects the new model. Three new aggregate_models tests cover the prefixing rule (bare vs multi-endpoint) and warn-and-skip on a failing endpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 09:10:16 +03:00
rob thijssen	537a0fe7f2	feat(helexa-acp): context compaction for small-context local models All checks were successful build-prerelease / Resolve version stamps (push) Successful in 26s Details CI / Format (push) Successful in 29s Details CI / Clippy (push) Successful in 2m26s Details build-prerelease / Build cortex binary (push) Successful in 5m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m51s Details CI / Test (push) Successful in 5m53s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m58s Details build-prerelease / Build neuron-ada (push) Successful in 5m30s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details A new src/compaction.rs module projects rolling conversation history into a token budget before each completion. Older tool results and assistant prose get elided to one-line markers; system prompts, user turns, and the last KEEP_TAIL=4 messages stay verbatim. tool_call_id pairing is preserved so OpenAI strict-schema providers keep working. Driven by a new per-endpoint `context_window` config field (also HELEXA_ACP_CONTEXT_WINDOW for the env-only single-endpoint case). When set, prompt budget = context_window - max_tokens - 512_safety; when unset, behaviour is unchanged. Without this, a 32 K Qwen3 dies with `prompt_too_long` after the first few read_file results pile up in history — the symptom seen in plan-mode dogfooding on beat. 10 new unit tests cover the compaction strategy and the prompt budget arithmetic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 08:22:01 +03:00
rob thijssen	cbadfcf112	feat(helexa-acp): plan mode — third session mode for read-and-plan-only flows Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m44s Details CI / Test (push) Successful in 5m3s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m37s Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m32s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Plan mode is the most restrictive of the three session modes: bash is disabled outright, writes are confined to a per-project plan directory under $XDG_DATA_HOME/helexa-acp/plans/<basename>-<8hex>/, and reads / list_dir are unrestricted. The system prompt is rebuilt at the top of every round so a mid-turn switch into (or out of) plan mode takes effect on the next streaming round, and plan mode appends a 3-option menu instructing the model to stop and let the user pick how to proceed once the plan is complete. The project id is basename + FNV-1a-32 of the cwd so it stays stable across runs (SipHash's DefaultHasher reseeds per process), while still disambiguating multiple checkouts that share a final path component. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 08:06:25 +03:00
rob thijssen	3ecbb21ece	fix(helexa-acp): persist per round, cancel previous prompt, log loop All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m32s Details CI / Test (push) Successful in 5m8s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 8m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m18s Details build-prerelease / Build cortex binary (push) Successful in 16m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m15s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Three changes addressing "session stops mid-turn and disk store doesn't update": 1. Per-round persistence. drive_prompt previously called store::save() once at the very end of the turn. If the loop stalled in a later round (long-running bash, upstream SSE that never finished, wedged ACP roundtrip), earlier successful rounds lived only in the spawned task's `new_turns` and never reached disk. Move the extend-history + save into a helper (extend_and_persist) and call it at the end of every loop iteration. The post-loop save catches whatever the break paths leave behind. Failure is logged not propagated. 2. Cancel previous in-flight prompt on new session/prompt. The handler used to overwrite SessionState.cancel with a fresh token without firing the old one. A wedged prior prompt would then live forever, holding session-state references and never persisting. Now we fire the existing cancel under the lock before installing the new token — the old task observes is_cancelled() on its next .await and unwinds. 3. Per-round and per-tool log lines. drive_prompt now emits: - INFO prompt round: streaming { round, of, history_turns } - INFO dispatch tool { tool, tool_call_id } - INFO dispatch tool complete { tool_call_id, is_error } - INFO prompt round complete; persisting { round, turns } - INFO prompt complete { stop_reason } so the next hang shows up by line number in /tmp/helexa-acp.log instead of as silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 16:29:22 +03:00
rob thijssen	0d841a4981	feat(helexa-acp): replay session history on session/load Some checks failed CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 48s Details CI / Test (push) Failing after 1m19s Details CI / Clippy (push) Successful in 2m56s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m52s Details build-prerelease / Build neuron-ampere (push) Successful in 7m49s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details session/list and session/load were both implemented but clicking a session in Zed's thread picker still left the agent panel empty. Zed (and ACP clients in general) doesn't cache the transcript for custom agent_servers entries — it only owns conversation state for first-party agents. For custom agents the expectation is that session/load returns successfully and the agent then re-emits the conversation as a stream of session/update notifications so the client can rebuild its view. Implement that replay path: - handle_load_session now returns (LoadSessionResponse, Vec<Message>) so the caller has the history available after the in-memory hydration finishes. - The session/load closure responds to the request first, then spawns a task that calls replay_history off the dispatch loop. - replay_history walks the persisted history and emits one session/update per turn: Role::User → UserMessageChunk(text) Role::Assistant text → AgentMessageChunk(text) Role::Assistant tool → AgentMessageChunk for any accompanying text + one ToolCall card per call (with kind/title/raw_input rendered the same way as the live dispatch path) Role::Tool result → ToolCallUpdate matching the assistant's call id, status: Completed, content set to the result text Role::System → skipped (system prompts aren't shown) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 16:02:00 +03:00
rob thijssen	0bbb9b752d	feat(helexa-acp): session/list so Zed can discover sessions to resume All checks were successful build-prerelease / Resolve version stamps (push) Successful in 28s Details CI / Format (push) Successful in 28s Details CI / Clippy (push) Successful in 2m45s Details build-prerelease / Build cortex binary (push) Successful in 4m41s Details CI / Test (push) Successful in 4m58s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m36s Details build-prerelease / Build neuron-ada (push) Successful in 5m40s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Stage 3b only implemented the trailing half of resume: write sessions to disk + handle session/load. But Zed (and any ACP client) needs `session/list` to discover which session belongs to the workspace it's reopening — without it, the client only knows how to mint new sessions and resume never fires even though the JSON sits ready on disk. Add the missing pieces: - store::list / list_in_dir — enumerate {id}.json under sessions_dir(), optionally filter by cwd, sort recent-first. Skips unparseable files with a warn rather than aborting. - store::unix_to_iso8601 — RFC 3339 formatter for SessionInfo.updated_at; pulls chrono in directly (already in the dep tree transitively). - agent::handle_list_sessions — wires the request to the store, builds SessionInfo entries with derived titles (first user turn, truncated to 60 chars). - agent::initialize_response — advertise session_capabilities.list = {} alongside the existing load_session: true. Verified end-to-end against the user's real hxa-1.json (60-turn beat conversation): `session/list` returns the entry with cwd, derived title, and ISO 8601 timestamp. 4 new store unit tests for list filtering, missing-dir handling, unparseable-file skipping, and ISO 8601 formatting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 14:34:41 +03:00
rob thijssen	5aac1ffc59	feat(helexa-acp): session resume via session/load All checks were successful CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m37s Details CI / Test (push) Successful in 4m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m35s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 7m45s Details build-prerelease / Build neuron-ada (push) Successful in 5m31s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Zed restarts (frequent during helexa-acp dogfooding) used to lose every conversation because we'd ignore the load_session capability and treat every project-reopen as a fresh session/new. Persist sessions to disk and honour session/load so the agent panel comes back where it left off. Storage layout: $XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json Each file holds session_id, cwd, model_id, mode_id, full Message history, plus created/updated timestamps. Atomic save via tempfile+rename so a crash mid-write can't corrupt the store. Touch points: - src/store.rs (new) — sessions_dir() resolution, save/load via default and explicit-dir entry points (so unit tests don't have to race on XDG_DATA_HOME). 5 unit tests cover round-trip, not-found errors, atomic overwrite, tool-call/result preservation, and the filename sanitiser's path-traversal handling. - src/provider/mod.rs — Serialize/Deserialize on Role, Message, MessageContent, ToolCall. MessageContent::Text turned into a struct variant ({text: ...}) so internally-tagged JSON works. - src/agent.rs — initialize_response advertises load_session: true; handle_load_session reads the file, snapshots in-memory state, returns LoadSessionResponse with the persisted mode preselected; drive_prompt persists at the end of every prompt round under the session lock with the I/O outside the lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:34:42 +03:00
rob thijssen	ec2b6450b2	feat(helexa-acp): infer tool name from arg shape when model omits it Some checks are pending build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m33s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Test (push) Successful in 5m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m40s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m33s Details build-prerelease / Package cortex RPM (push) Successful in 8m20s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details Qwen3.6-27B occasionally emits a <tool_call> body with the right arguments but no top-level `name` field — observed in the field as mkdir-style bash calls like {"arguments":{"command":"mkdir -p .../doc/plan/{01-discovery,...}"}} with no `name`. The agent had no tool to dispatch and surfaced a Failed card; the model would then hang or retry the same shape. Add a shape-based inference layer: - tools::infer_tool_name(arguments) — given an `arguments` object alone, return Some(name) when the key set uniquely identifies one tool: `{command}` or `{command,cwd}` → bash, `{path,content}` → write_file, `{path,old_text,new_text}` → edit_file. Ambiguous shapes (`{path}` alone — could be read_file or list_dir) return None so the agent still emits a Failed card rather than guessing. - agent::try_repair_missing_name(raw) — parses a malformed body, applies infer_tool_name, returns (name, args_json) on success. - drive_prompt sweeps malformed_calls through this repair before the Failed-card path. Recovered calls go into tool_buckets at the next free index and dispatch through the normal tool loop. 10 new unit tests in tools::tests cover the inference table plus the verbatim mkdir failure from the field log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:14:50 +03:00
rob thijssen	a494c8d43c	feat(helexa-acp): repair malformed tool calls and render failures as cards Some checks failed build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 28s Details CI / Format (push) Successful in 4m7s Details CI / Test (push) Failing after 1m2s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details CI / Clippy (push) Successful in 2m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-ampere (push) Successful in 8m18s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ada (push) Successful in 5m23s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Two related fixes for cases where Qwen3 sometimes emits slightly-off JSON inside <tool_call> blocks: 1. JSON repair pass in qwen3::parse_tool_call_body — strip up to three trailing extra `}` characters (model overshoots its closing braces), and hoist `name` out of `arguments` when it lands nested instead of as a sibling. Both observed in the field; both trivially repairable; both now dispatch as normal tool calls instead of falling back to the malformed path. 2. New CompletionEvent::MalformedToolCall variant for the cases repair can't fix. decode_stream now emits it instead of wrapping the raw body in a TextDelta, and agent.rs surfaces each one as a Failed SessionUpdate::ToolCall card (so Zed renders it as a structured failure UI element rather than dumping the body inline) plus a synthetic tool-call/tool-result history pair so the model gets clear feedback for self-correction on the next round. Empty <tool_call></tool_call> blocks are now a no-op too (no Malformed event), matching the existing empty-<think> behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:58:51 +03:00
rob thijssen	abbedf8d8a	chore(neuron): bump default max_tokens from 512 to 8192 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 45s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m35s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details 512 is too low for any modern coding model — clients that don't explicitly set max_tokens get clipped responses with no diagnostic. Bump the fallback at all four inference call sites (single-GPU streaming + non-streaming, TP leader + non-leader) to 8192, which fits comfortably within Qwen3-class context windows after a typical agent prompt and lines up with what helexa-acp / a0 / curl clients reasonably expect. Clients that explicitly set max_tokens (now including helexa-acp via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:38:28 +03:00
rob thijssen	6cc14e925c	feat(helexa-acp): per-endpoint max_tokens config Some checks failed CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Failing after 1m3s Details CI / Test (push) Failing after 1m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The agent was sending max_tokens: None, letting cortex/neuron pick its own default — which trips Zed's "Output Limit Reached" on long turns. Add a per-endpoint max_tokens option in EndpointConfig (TOML key and HELEXA_ACP_MAX_TOKENS env var for the single-endpoint fallback) that the agent threads into every CompletionRequest by endpoint name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:34:23 +03:00
rob thijssen	1c16732668	feat(helexa-acp): route Qwen3 inline <think> blocks to reasoning Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 26s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Clippy (push) Successful in 2m40s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Qwen3 emits chain-of-thought as literal <think>...</think> tags inside delta.content rather than via the separate reasoning_content field — so without parsing the markers, the thinking shows up in the message pane as ordinary text. Add a small ThinkParser in qwen3.rs (same chunk-boundary discipline as ToolCallParser) and stage it after the tool-call parser in decode_stream: text events from the tool-call parser are fed in and split into TextDelta / ReasoningDelta. Zed now renders thinking in its dedicated thought UI; visible answer text stays in the message pane. The parking-lot entry from the plan is now closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:30:25 +03:00
rob thijssen	5a0861d639	fix(helexa-acp): forward Dispatch::Response to its awaiting router Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Test (push) Successful in 5m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m29s Details build-prerelease / Build neuron-ampere (push) Successful in 7m18s Details build-prerelease / Build neuron-ada (push) Successful in 5m6s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The catch-all on_receive_dispatch handler was applying respond_with_error to every Dispatch variant, including Response. For Response variants, that call routes the error to the ResponseRouter for the outgoing request — silently overwriting the real reply from Zed with "Internal error: not implemented yet". Every ACP roundtrip we issue (fs/read_text_file, fs/write_text_file, session/request_permission, terminal/*) was therefore returning an error to the tool runner regardless of what Zed actually responded. The model saw uniformly-failing tools, gave up, and confabulated plausible explanations. Fix: pattern-match the Dispatch. Response → forward to its router via respond_with_result. Request / Notification → keep the "not implemented yet" error response as before. Found via debug logs showing WARN helexa_acp::agent: unhandled ACP message method="fs/read_text_file" right before every tool failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:16:21 +03:00
rob thijssen	33652ac651	feat(helexa-acp): HELEXA_ACP_LOG_FILE env for editor-host logging All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m44s Details CI / Test (push) Successful in 5m3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 8m23s Details build-prerelease / Build neuron-ada (push) Successful in 5m26s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Editors that launch ACP agents (Zed today) don't reliably surface the child's stderr — and `args` in an `agent_servers` config is exec-args, not shell, so the usual `&>>` redirect trick doesn't work. Add a HELEXA_ACP_LOG_FILE env var that, when set to an absolute path, routes the tracing subscriber to append-write that file (ANSI off) instead of stderr. RUST_LOG still controls levels. Unopenable paths fall back to stderr with a warning so a typo doesn't silence the agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:47:28 +03:00
rob thijssen	c297a54074	chore(helexa-acp): log raw bash output and tool result snippets All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m38s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m34s Details build-prerelease / Build cortex binary (push) Successful in 4m49s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 7m46s Details build-prerelease / Build neuron-ada (push) Successful in 7m38s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Diagnostic for "the tool ran but the model thinks it failed" cases. Logs at debug level: - exec_bash: terminal/create command + cwd, terminal/exit code/signal, terminal/output bytes + truncated flag + 200-char snippet. - dispatch_tool_call: 200-char snippet of every successful result before it's folded back into history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:15:26 +03:00
rob thijssen	0121a1930f	feat(helexa-acp): inject and parse Qwen3 Hermes tool format Some checks failed CI / Format (push) Successful in 38s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m33s Details CI / Test (push) Successful in 5m45s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m0s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 7m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details The OpenAI `tools` API field isn't load-bearing in this stack — neuron's chat template renders only message.content, so tool definitions sent that way never reach the model. Move both sides of the tool conversation into the Qwen3 Hermes wire format the model is actually trained on: - Append a `# Tools` block to the system prompt describing every available function (qwen3::render_tool_block). - Parse `<tool_call>{json}</tool_call>` markers out of the streamed content via a chunk-boundary-safe state machine (qwen3::ToolCallParser), surfacing them as the existing CompletionEvent::ToolCall* events so the agent loop doesn't change. - Re-serialise assistant turns that called tools with inline `<tool_call>` blocks and tool results as user turns wrapped in `<tool_response>` (qwen3::render_assistant_with_tool_calls, render_tool_response). Verified against cortex+Qwen3.6-27B: the model produces a well-formed `<tool_call>{"name":"list_dir","arguments":{"path":"/tmp"}}</tool_call>` in response to a Hermes-formatted prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:06:38 +03:00
rob thijssen	13f4c36aeb	chore(helexa-acp): log outgoing chat-completion body at debug level Some checks failed build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 47s Details CI / Clippy (push) Failing after 56s Details CI / Test (push) Successful in 5m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m22s Details build-prerelease / Build cortex binary (push) Successful in 6m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m14s Details build-prerelease / Build neuron-ada (push) Successful in 5m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Useful for diagnosing "the model isn't using tools" — confirming that helexa-acp is in fact sending the `tools` array (and what messages, system prompt, etc. accompany it) without having to attach a packet capture upstream of cortex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 10:38:10 +03:00
rob thijssen	4a51a54554	fix(helexa-acp): describe Stage 3 tools in the default system prompt Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 42s Details CI / Clippy (push) Successful in 2m39s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The Stage 2 prompt told the model it had no tools, which models trained for caution then dutifully repeat back ("Stage 2 build: no tools available — I can't read files…"). Stage 3 ships tools in the CompletionRequest.tools array, but the system message was still overriding that. Update the default prompt to list the five tools and instruct the model to use them rather than asking the user to paste contents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 10:33:17 +03:00
rob thijssen	0609f1ac5d	feat(helexa-acp): add tools, session modes, and permission gating All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m38s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m54s Details build-prerelease / Build neuron-ampere (push) Successful in 7m54s Details build-prerelease / Build neuron-ada (push) Successful in 4m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m14s Details build-prerelease / Build cortex binary (push) Successful in 4m9s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 6m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 3m54s Details Stage 3 introduces five tools (read_file, write_file, edit_file, list_dir, bash) backed by ACP fs/* and terminal/* calls, a ClientOps trait so the runner is mock-testable, two session modes (default + bypassPermissions) with session/set_mode honouring them, and a tool-call loop in the agent that streams the model, dispatches each call, feeds results back into history, and re-enters until the model finishes or MAX_TOOL_ROUNDS is hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 10:01:32 +03:00
rob thijssen	96fc379893	feat(helexa-acp): wire ACP agent loop for text-only conversations Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 41s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 5m26s Details CI / Test (push) Successful in 5m43s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m28s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Stage 2 lands the agent loop on top of the Stage 1 scaffold: session state with per-session cancellation, a system-prompt builder honouring HELEXA_ACP_SYSTEM_PROMPT_PATH / system_prompt_path TOML, and handlers for initialize / session/new / session/prompt / session/cancel that stream provider output back as session/update notifications. Verified end-to-end against cortex from Zed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 09:46:22 +03:00
rob thijssen	e267f583e1	chore(neuron): rustfmt drift in is_device_fault test Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 58s Details CI / Clippy (push) Failing after 3m43s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m41s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details One assert! call grew past the line limit after the previous commits; cargo fmt --all picked it up. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:55 +03:00
rob thijssen	e23d5011d0	feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat Adds a new workspace crate `helexa-acp` (binary, Apache-2.0) — the start of "the missing ACP binary" for multi-endpoint LLM setups mixing public APIs, private LAN deployments, and various wire formats. Today it speaks OpenAI /v1/chat/completions; the Provider trait is the seam that lets OpenAI Responses, Anthropic /v1/messages, and other wire formats slot in later without touching the agent loop. The crate is intentionally self-contained — no dependencies on the other workspace crates (cortex-core, cortex-gateway, neuron) — so a future migration to a dedicated GitHub repo is a Cargo.toml-only change. All deps come from crates.io. This commit lands: * `config.rs` — TOML config at $XDG_CONFIG_HOME/helexa-acp/config.toml with multi-endpoint support (each `[[endpoints]]` declares its name, base_url, wire_api, default_model, optional API key / api_key_env). Falls back to env-only single-endpoint config when no TOML exists (HELEXA_ACP_BASE_URL, HELEXA_ACP_MODEL, etc.). The `endpoint:model` selector syntax is validated and tested. * `provider/mod.rs` — `Provider` trait + provider-agnostic types (`CompletionRequest`, `CompletionEvent`, `Message`, `ToolCall`, `ToolSpec`, `Role`, `UsageStats`). Agent loop consumes these without knowing the wire format on the other side. * `provider/openai_chat.rs` — `OpenAIChatProvider` impl. Compatible with cortex, LM Studio, Ollama (compat mode), OpenRouter, OpenAI itself. Streams via reqwest + eventsource-stream + async-stream. Surfaces text deltas, reasoning deltas (for models that emit `reasoning_content`), tool-call lifecycle (start, args-delta, completion), usage, finish reason. Cancellation-token aware. * `main.rs` — tokio + stderr-only tracing-subscriber + Stdio transport. Builds a provider per configured endpoint at startup, surfacing config mistakes before the editor even initializes. Currently responds to `initialize`; everything else stubs to `not implemented yet` until the agent loop lands in the next commit. 12 unit tests pass — encoder shape, decoder shape (text-only, tool-call progressive, cancellation, malformed-chunk recovery), config parsing (multi-endpoint TOML, env fallback, validation). The `#![allow(dead_code)]` on `provider/mod.rs` is temporary — the agent loop in the next commit reads every field. It's noted in the module-level docstring so the next reader knows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:47 +03:00
rob thijssen	249b2e5c98	fix(neuron): only poison the model on actual device faults Some checks failed build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m22s Details CI / Test (push) Successful in 4m55s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CI / Format (push) Failing after 33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Previously every inference Err — shape mismatch, NaN logits, tokenizer error, missing handle — marked the model poisoned and rejected every subsequent request until an operator unload+reloaded. The benjy incident on 2026-05-27 showed how this misfires: a concurrency bug produced a `broadcast_add: shape mismatch` error that had nothing to do with CUDA, but the model was taken down anyway. Add `is_device_fault(err_chain: &str)` — a conservative classifier that returns false only for errors we know are pre-kernel / CPU-side (shape mismatches, NaN logits, tokenize/detokenize, missing handle, DecodeStream, empty prompt). Everything else defaults to true so a genuine driver fault still poisons. Applied at all six poisoning sites: - chat_completion CUDA worker path - chat_completion CPU spawn_blocking path - chat_completion_stream CUDA worker path - chat_completion_stream CPU spawn_blocking path - chat_completion_tp non-streaming wrapper - chat_completion_tp_stream spawned task Each site now logs either "model marked poisoned" (device fault) or "model NOT marked poisoned" (non-device) so the journal makes the classification visible. Tests cover the known non-device patterns and a couple of real CUDA driver messages. Pairs with the inference_lock commit (`c59da83`): together they eliminate both the cause of the spurious-poisoning we just observed (the shape mismatch) AND the over-reaction to it (the unconditional poison). Each fix is independently useful but the combination is what makes the system actually robust to concurrent agent workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:57:48 +03:00
rob thijssen	c59da83636	fix(neuron): serialise single-GPU inference per loaded model Two concurrent chat_completion requests against the same single-GPU model could interleave their `clear_kv_cache → forward(chunk0) → forward(chunk1) → ...` sequences. The device-worker channel serialises individual jobs but not the sequence boundary, so the cache could end up holding tokens from one request while another's mask was sized for its own prompt — producing a shape mismatch mid-prefill. Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories` and `memorize solutions` extensions fired 4ms apart against Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV cache, and request a08b4a's chunk 0 forward produced scores of shape [1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add failed, both requests bubbled the error up, both flipped the model to poisoned. Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring the TpLoadedModel.pool lock that the TP path already held. Acquire it at the start of `chat_completion` and inside the spawned task of `chat_completion_stream` (so the role chunk goes out immediately and only the inference work queues behind the lock). The CPU branch uses `blocking_lock` from inside spawn_blocking; the CUDA branch uses async `.lock().await` inside tokio::spawn. Throughput impact: zero. The GPU was already serialised at the device-worker channel — multiple requests just produced corrupt KV cache state instead of clean serial throughput. The lock makes the existing serialisation honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:54:04 +03:00
rob thijssen	f05882369d	fix(neuron): don't poison the model on tokio JoinError panics All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m28s Details build-prerelease / Build neuron-ampere (push) Successful in 8m25s Details build-prerelease / Build neuron-ada (push) Successful in 8m54s Details CI / Test (push) Successful in 4m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA driver failures propagate as Err through `?` and become `Ok(Err(InferenceError::Other(_)))` from the spawned task — those are real device faults and still poison the model. Tokio JoinError is different: it fires on Rust-level panic (tokenizer bug, sampler bug, serialisation, the UTF-8 slice that landed in commit `bd04d7f` before the fix) or task cancellation. Those don't touch the device context, so failing the one request without tearing down the model is correct. Two sites changed: - chat_completion's CPU spawn_blocking handler — JoinError no longer sets loaded.poisoned. - chat_completion_tp's tokio::spawn wrapper — JoinError no longer sets tp_for_marker.poisoned. The inner-Err case still does. Each path logs the cause (panicked / was cancelled / ended abnormally) explicitly so the journal makes the new behaviour obvious — search for "model NOT marked poisoned" to find these events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:02:52 +03:00
rob thijssen	bd04d7f580	fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji) across multiple tokens, the previous "decode the cumulative token list, byte-slice the delta against a stored prefix" pattern would panic with 'start byte index N is not a char boundary; it is inside <emoji>'. The race: at step N the tokenizer renders the partial bytes as U+FFFD (3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4 bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the codepoint in step N+1's `full` string, and `&str[start..]` panics. Replace with tokenizers' `DecodeStream::step(id)` which maintains an internal byte buffer across token boundaries and only emits when a clean codepoint completes. Applied at all three SSE emission sites: - stream_inference_via_worker (single-GPU CUDA stream) - chat_completion_tp_stream's spawned task (TP stream) - run_inference_streaming (CPU stream) The shared emit helper splits into emit_delta (async, mpsc::send) and emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its existing send semantics. The old emit_chunk helper that did the unsafe full-decode-and-slice is removed entirely. Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call response after a long agent-zero session; the spawned TP stream task panicked at candle.rs:2648. The model itself stayed healthy (no CUDA fault), only the one streaming request died. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:01:24 +03:00
rob thijssen	1e13889392	feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m7s Details build-prerelease / Build neuron-ampere (push) Successful in 12m16s Details build-prerelease / Build neuron-ada (push) Successful in 12m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload cycle observed on beast under agent-zero workloads. Three changes, all keyed off env-driven knobs so an operator can tune without a rebuild: 1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The initial forward is split into N-token windows, each with a monotonically growing offset. KV cache accumulates across chunks exactly as it would under one big prefill; only the final chunk's logits are kept for sampling. Activation memory now scales with chunk size instead of prompt length, so a 13 k-token prompt stops holding tens of GB of intermediate activations live at once. Wired into all six prefill call sites: - run_inference / run_inference_streaming (CPU path) - run_inference_via_worker / stream_inference_via_worker (CUDA single-GPU through device worker) - chat_completion_tp_inner / chat_completion_tp_stream (TP via WorkerPool) Three helpers — chunked_prefill_local, chunked_prefill_via_worker, chunked_prefill_tp — own the loop shape so the chunking semantics stay identical across paths. Per-chunk debug log shows progress. 2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384). Requests above the cap return a structured 400 with `code: prompt_too_long` rather than going through the prefill and discovering the limit by OOMing partway through. New InferenceError::PromptTooLong variant. 3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500). If `vram_free_mb` is below the threshold at request start (e.g. another concurrent request is mid-prefill), reject with a clean 503 + `code: insufficient_vram` rather than starting work that will OOM. New InferenceError::InsufficientVram variant. CPU loads (vram=0 sentinel) skip this check. All three gates fire BEFORE any device work, so a rejected request costs ~one tokenisation pass and never touches the worker thread — poison cascades from rejected work are now impossible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:46:54 +03:00
rob thijssen	35876954cd	chore(neuron): default tracing filter to info (was info,neuron=debug) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m43s Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Production deployments that want neuron-internal debug detail (e.g. trim_device_pool's per-clear-kv line, slab inserts/drops) override RUST_LOG explicitly via systemd. Defaulting to debug for the whole neuron target produced a lot of journal volume that wasn't useful in the common case. beast already sets RUST_LOG=debug in /etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:47:30 +03:00
rob thijssen	cdf0f4e66d	fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync returns memory to the device's default mempool, not to the OS), so mem_get_info under-reports free VRAM between requests. With Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw ~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into CUBLAS_STATUS_INTERNAL_ERROR depending on quant. Add a cuda-gated trim_device_pool helper that, after each successful clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool, 0) against the device's default mempool. Failures (no async-alloc support, transient driver errors) are non-fatal and log at debug. The before/after free-VRAM delta is logged so an operator can correlate the trim with the next request's prefill VRAM. ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors correctly; the leak was strictly at the cudarc pool layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:13 +03:00
rob thijssen	b4f3576d82	refactor(neuron): phase 4 — model loads move onto the device worker All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:24:38 +03:00
rob thijssen	76ab24d98c	refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker Some checks failed CI / Format (push) Successful in 29s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl` / `Job::Tp` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:16:02 +03:00
rob thijssen	b179204fd3	refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Resolve version stamps (push) Successful in 3m41s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build cortex binary (push) Successful in 12m20s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Second slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two spawn_blocking sites in `chat_completion` and `chat_completion_stream` now route through the device worker thread on CUDA loads. CPU loads keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's no context to own and the channel hop would only add latency. What this phase changes: - `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>` slab and a `next_handle` counter for minting opaque handles. - `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`, plus a new `arch_handle: Option<ArchHandle>` field. The two are mutually exclusive: CUDA loads set `arch_handle = Some(_)` after transferring the boxed arch into the worker's slab; CPU loads keep `arch = Some(_)` for the legacy spawn_blocking path. - New `run_inference_via_worker` and `stream_inference_via_worker` drive the prefill + decode loop by sending `Job::ForwardLogits` per step; the worker copies the resulting `[vocab]` logits to a CPU-side `Vec<f32>` before reply, so the async caller never holds a device-resident tensor. `apply_repeat_penalty` and `LogitsProcessor::sample` run on a CPU candle tensor; no context binding side-effects on tokio worker threads. - `logits_health_slice(&[f32])` complements the existing `logits_health(&Tensor)` so the new worker paths can compute health stats directly from the CPU vec. - `unload_model` for the single-GPU CUDA path now sends `Job::DropArch { handle }` to the worker so the `Box<ModelArch>` drops on the thread that allocated its CUDA tensors. The `Drop` runs with the bound context, freeing memory on the right context. What this phase doesn't touch (yet): - TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3. - Single-GPU model load — still spawn_blocking, followed by a `Job::TransferIn` to move the freshly-built `ModelArch` into the worker slab. Phase 4 moves the load itself onto the worker thread and eliminates the bootstrap TransferIn. - The `device_vram_mb` / `cuda_mem_mb` helpers — still present and used by the construction-time logs running inside spawn_blocking loads. Phase 4 cleanup folds them into `dispatch.rs`. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:55:08 +03:00

1 2 3

129 Commits