Commit Graph

113 Commits

Author SHA1 Message Date
df0abfe4d4 feat(helexa-acp): image input for vision-capable models
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m2s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m27s
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Stage 5. Zed clipboard/DnD images get forwarded as OpenAI
content-array messages on user turns.

- New MessageContent::MultiPart variant + MessagePart (Text|Image)
  + ImageData struct (mime_type, base64 data, optional uri).
- flatten_prompt now produces structured content: collapses to
  Text when every block is text (some upstreams treat array-form
  as vision-only and refuse on text-only models), otherwise
  produces MultiPart preserving block order.
- OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url",
  image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user
  messages. Data URIs are used over remote `uri` because they
  round-trip through every upstream we care about.
- prompt_capabilities.image = true at initialize so Zed actually
  sends image blocks.
- compaction estimates ~512 tokens per image (the middle of the
  Qwen3-VL / OpenAI detail range) so the budget tracker doesn't
  pretend images are free.
- session/load replays image-bearing user turns by surfacing the
  text parts verbatim and rendering each image as a "[image: {mime}
  ({n} bytes)]" placeholder chunk — Zed can show the prior text
  context even though re-uploading the bytes through ACP isn't
  meaningful for resume.
- 4 new tests: flatten produces MultiPart in block order, image-only
  prompts still flatten to MultiPart, encoder emits the correct
  array shape, text-only encoding stays as the string form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:43:00 +03:00
b9016571f6 feat(helexa-acp): expand ~ / $HOME and fall back to local fs on ACP read errors
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Format (push) Successful in 50s
CI / Clippy (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 4m29s
CI / Test (push) Successful in 5m13s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Build neuron-ampere (push) Successful in 8m15s
build-prerelease / Build neuron-ada (push) Successful in 5m23s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Two related polish fixes for daily use:

- New `path_util` module expands `~`, `~/…`, `$HOME`, and `$HOME/…`
  prefixes in every tool that takes a path (read_file, write_file,
  edit_file, list_dir, bash cwd). The expansion is also applied to
  the plan-mode write gate so `~/.local/share/helexa-acp/plans/…`
  comparisons behave correctly regardless of which form the model
  emits.
- `read_file` now falls back to `std::fs::read_to_string` when ACP's
  `fs/read_text_file` errors out. Zed's workspace-scoped read was
  the source of "model can't see ~/git/architecture/generic.md"
  when the session cwd is a different project; the fallback lets
  the agent pull in shared material that lives outside the active
  workspace, the same way `list_dir` already does via local
  `std::fs::read_dir`. Local fallback honours line/limit args.

The fallback also produces a combined error message when both ACP
and local-fs reads fail, so the model sees what actually broke
rather than just the ACP-side error.

14 new unit tests cover path_util's prefix matrix, fallback
success/failure paths, and the line/limit slicing in fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:28:58 +03:00
adbc52bfcd feat(helexa-acp): model picker + session/set_model handler
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m32s
build-prerelease / Build cortex binary (push) Successful in 4m45s
CI / Test (push) Successful in 5m52s
build-prerelease / Build neuron-blackwell (push) Successful in 5m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 7m21s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 4m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Stage 4. Zed's model dropdown now lists every model from every
configured endpoint, and switching it routes the next prompt to a
new endpoint+model.

- Enable `unstable_session_model` on the agent-client-protocol dep
  so SessionModelState / SetSessionModelRequest / ModelInfo are
  available.
- Agent::new becomes async and calls Provider::list_models on every
  provider at startup; per-endpoint failures warn-and-skip instead
  of aborting the agent.
- With a single endpoint configured, model ids appear bare; with
  multiple endpoints every id carries the `endpoint:` prefix so the
  picker is unambiguous and parse_model_selector routes correctly.
- NewSessionResponse and LoadSessionResponse attach SessionModelState
  with the session's current model id + the aggregated catalogue.
- session/set_model: validates the requested model id against
  resolve_provider, mutates session.model_id, and persists so the
  on-disk transcript reflects the new model.

Three new aggregate_models tests cover the prefixing rule (bare vs
multi-endpoint) and warn-and-skip on a failing endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:10:16 +03:00
537a0fe7f2 feat(helexa-acp): context compaction for small-context local models
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 26s
CI / Format (push) Successful in 29s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build cortex binary (push) Successful in 5m17s
build-prerelease / Build neuron-blackwell (push) Successful in 5m51s
CI / Test (push) Successful in 5m53s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m58s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
A new src/compaction.rs module projects rolling conversation history
into a token budget before each completion. Older tool results and
assistant prose get elided to one-line markers; system prompts, user
turns, and the last KEEP_TAIL=4 messages stay verbatim. tool_call_id
pairing is preserved so OpenAI strict-schema providers keep working.

Driven by a new per-endpoint `context_window` config field (also
HELEXA_ACP_CONTEXT_WINDOW for the env-only single-endpoint case).
When set, prompt budget = context_window - max_tokens - 512_safety;
when unset, behaviour is unchanged.

Without this, a 32 K Qwen3 dies with `prompt_too_long` after the
first few read_file results pile up in history — the symptom seen
in plan-mode dogfooding on beat.

10 new unit tests cover the compaction strategy and the prompt
budget arithmetic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:22:01 +03:00
cbadfcf112 feat(helexa-acp): plan mode — third session mode for read-and-plan-only flows
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m44s
CI / Test (push) Successful in 5m3s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-blackwell (push) Successful in 6m37s
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Build neuron-ada (push) Successful in 5m32s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Plan mode is the most restrictive of the three session modes: bash is
disabled outright, writes are confined to a per-project plan directory
under $XDG_DATA_HOME/helexa-acp/plans/<basename>-<8hex>/, and reads /
list_dir are unrestricted. The system prompt is rebuilt at the top of
every round so a mid-turn switch into (or out of) plan mode takes
effect on the next streaming round, and plan mode appends a 3-option
menu instructing the model to stop and let the user pick how to
proceed once the plan is complete.

The project id is basename + FNV-1a-32 of the cwd so it stays stable
across runs (SipHash's DefaultHasher reseeds per process), while still
disambiguating multiple checkouts that share a final path component.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:06:25 +03:00
3ecbb21ece fix(helexa-acp): persist per round, cancel previous prompt, log loop
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m32s
CI / Test (push) Successful in 5m8s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Build neuron-ampere (push) Successful in 8m13s
build-prerelease / Build neuron-ada (push) Successful in 5m18s
build-prerelease / Build cortex binary (push) Successful in 16m12s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Three changes addressing "session stops mid-turn and disk store
doesn't update":

1. Per-round persistence. drive_prompt previously called
   store::save() once at the very end of the turn. If the loop
   stalled in a later round (long-running bash, upstream SSE that
   never finished, wedged ACP roundtrip), earlier successful
   rounds lived only in the spawned task's `new_turns` and never
   reached disk. Move the extend-history + save into a helper
   (extend_and_persist) and call it at the end of every loop
   iteration. The post-loop save catches whatever the break paths
   leave behind. Failure is logged not propagated.

2. Cancel previous in-flight prompt on new session/prompt. The
   handler used to overwrite SessionState.cancel with a fresh
   token *without firing the old one*. A wedged prior prompt would
   then live forever, holding session-state references and never
   persisting. Now we fire the existing cancel under the lock
   before installing the new token — the old task observes
   is_cancelled() on its next .await and unwinds.

3. Per-round and per-tool log lines. drive_prompt now emits:
   - INFO  prompt round: streaming { round, of, history_turns }
   - INFO  dispatch tool { tool, tool_call_id }
   - INFO  dispatch tool complete { tool_call_id, is_error }
   - INFO  prompt round complete; persisting { round, turns }
   - INFO  prompt complete { stop_reason }
   so the next hang shows up by line number in /tmp/helexa-acp.log
   instead of as silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 16:29:22 +03:00
0d841a4981 feat(helexa-acp): replay session history on session/load
Some checks failed
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 48s
CI / Test (push) Failing after 1m19s
CI / Clippy (push) Successful in 2m56s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-blackwell (push) Successful in 5m52s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
session/list and session/load were both implemented but clicking
a session in Zed's thread picker still left the agent panel
empty. Zed (and ACP clients in general) doesn't cache the
transcript for custom agent_servers entries — it only owns
conversation state for first-party agents. For custom agents the
expectation is that session/load returns successfully and the
agent then re-emits the conversation as a stream of session/update
notifications so the client can rebuild its view.

Implement that replay path:

- handle_load_session now returns (LoadSessionResponse, Vec<Message>)
  so the caller has the history available after the in-memory
  hydration finishes.
- The session/load closure responds to the request *first*, then
  spawns a task that calls replay_history off the dispatch loop.
- replay_history walks the persisted history and emits one
  session/update per turn:
    Role::User           → UserMessageChunk(text)
    Role::Assistant text → AgentMessageChunk(text)
    Role::Assistant tool → AgentMessageChunk for any accompanying
                           text + one ToolCall card per call (with
                           kind/title/raw_input rendered the same
                           way as the live dispatch path)
    Role::Tool result    → ToolCallUpdate matching the assistant's
                           call id, status: Completed, content set
                           to the result text
    Role::System         → skipped (system prompts aren't shown)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 16:02:00 +03:00
0bbb9b752d feat(helexa-acp): session/list so Zed can discover sessions to resume
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 28s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m45s
build-prerelease / Build cortex binary (push) Successful in 4m41s
CI / Test (push) Successful in 4m58s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m36s
build-prerelease / Build neuron-ada (push) Successful in 5m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Stage 3b only implemented the trailing half of resume: write
sessions to disk + handle session/load. But Zed (and any ACP
client) needs `session/list` to discover *which* session belongs
to the workspace it's reopening — without it, the client only
knows how to mint new sessions and resume never fires even
though the JSON sits ready on disk.

Add the missing pieces:

- store::list / list_in_dir — enumerate {id}.json under
  sessions_dir(), optionally filter by cwd, sort recent-first.
  Skips unparseable files with a warn rather than aborting.
- store::unix_to_iso8601 — RFC 3339 formatter for
  SessionInfo.updated_at; pulls chrono in directly (already in
  the dep tree transitively).
- agent::handle_list_sessions — wires the request to the store,
  builds SessionInfo entries with derived titles (first user
  turn, truncated to 60 chars).
- agent::initialize_response — advertise
  session_capabilities.list = {} alongside the existing
  load_session: true.

Verified end-to-end against the user's real hxa-1.json
(60-turn beat conversation): `session/list` returns the entry
with cwd, derived title, and ISO 8601 timestamp.

4 new store unit tests for list filtering, missing-dir
handling, unparseable-file skipping, and ISO 8601 formatting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:34:41 +03:00
5aac1ffc59 feat(helexa-acp): session resume via session/load
All checks were successful
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Build neuron-ampere (push) Successful in 7m45s
build-prerelease / Build neuron-ada (push) Successful in 5m31s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Zed restarts (frequent during helexa-acp dogfooding) used to lose
every conversation because we'd ignore the load_session capability
and treat every project-reopen as a fresh session/new. Persist
sessions to disk and honour session/load so the agent panel comes
back where it left off.

Storage layout:
  $XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json

Each file holds session_id, cwd, model_id, mode_id, full Message
history, plus created/updated timestamps. Atomic save via
tempfile+rename so a crash mid-write can't corrupt the store.

Touch points:

- src/store.rs (new) — sessions_dir() resolution, save/load via
  default and explicit-dir entry points (so unit tests don't have
  to race on XDG_DATA_HOME). 5 unit tests cover round-trip,
  not-found errors, atomic overwrite, tool-call/result preservation,
  and the filename sanitiser's path-traversal handling.
- src/provider/mod.rs — Serialize/Deserialize on Role, Message,
  MessageContent, ToolCall. MessageContent::Text turned into a
  struct variant ({text: ...}) so internally-tagged JSON works.
- src/agent.rs — initialize_response advertises load_session: true;
  handle_load_session reads the file, snapshots in-memory state,
  returns LoadSessionResponse with the persisted mode preselected;
  drive_prompt persists at the end of every prompt round under the
  session lock with the I/O outside the lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:34:42 +03:00
ec2b6450b2 feat(helexa-acp): infer tool name from arg shape when model omits it
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m33s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m40s
build-prerelease / Build neuron-ampere (push) Successful in 7m53s
build-prerelease / Build neuron-ada (push) Successful in 5m33s
build-prerelease / Package cortex RPM (push) Successful in 8m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
Qwen3.6-27B occasionally emits a <tool_call> body with the right
arguments but no top-level `name` field — observed in the field as
mkdir-style bash calls like
  {"arguments":{"command":"mkdir -p .../doc/plan/{01-discovery,...}"}}
with no `name`. The agent had no tool to dispatch and surfaced a
Failed card; the model would then hang or retry the same shape.

Add a shape-based inference layer:

- tools::infer_tool_name(arguments) — given an `arguments` object
  alone, return Some(name) when the key set uniquely identifies one
  tool: `{command}` or `{command,cwd}` → bash, `{path,content}` →
  write_file, `{path,old_text,new_text}` → edit_file. Ambiguous
  shapes (`{path}` alone — could be read_file or list_dir) return
  None so the agent still emits a Failed card rather than guessing.
- agent::try_repair_missing_name(raw) — parses a malformed body,
  applies infer_tool_name, returns (name, args_json) on success.
- drive_prompt sweeps malformed_calls through this repair before
  the Failed-card path. Recovered calls go into tool_buckets at
  the next free index and dispatch through the normal tool loop.

10 new unit tests in tools::tests cover the inference table plus
the verbatim mkdir failure from the field log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:14:50 +03:00
a494c8d43c feat(helexa-acp): repair malformed tool calls and render failures as cards
Some checks failed
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 28s
CI / Format (push) Successful in 4m7s
CI / Test (push) Failing after 1m2s
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
CI / Clippy (push) Successful in 2m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-ampere (push) Successful in 8m18s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ada (push) Successful in 5m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Two related fixes for cases where Qwen3 sometimes emits slightly-off
JSON inside <tool_call> blocks:

1. JSON repair pass in qwen3::parse_tool_call_body — strip up to
   three trailing extra `}` characters (model overshoots its closing
   braces), and hoist `name` out of `arguments` when it lands
   nested instead of as a sibling. Both observed in the field; both
   trivially repairable; both now dispatch as normal tool calls
   instead of falling back to the malformed path.

2. New CompletionEvent::MalformedToolCall variant for the cases
   repair can't fix. decode_stream now emits it instead of wrapping
   the raw body in a TextDelta, and agent.rs surfaces each one as
   a Failed SessionUpdate::ToolCall card (so Zed renders it as a
   structured failure UI element rather than dumping the body
   inline) plus a synthetic tool-call/tool-result history pair so
   the model gets clear feedback for self-correction on the next
   round.

Empty <tool_call></tool_call> blocks are now a no-op too (no
Malformed event), matching the existing empty-<think> behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:58:51 +03:00
abbedf8d8a chore(neuron): bump default max_tokens from 512 to 8192
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Format (push) Successful in 45s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build neuron-blackwell (push) Successful in 5m35s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 8m6s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
512 is too low for any modern coding model — clients that don't
explicitly set max_tokens get clipped responses with no diagnostic.
Bump the fallback at all four inference call sites (single-GPU
streaming + non-streaming, TP leader + non-leader) to 8192, which
fits comfortably within Qwen3-class context windows after a
typical agent prompt and lines up with what helexa-acp / a0 / curl
clients reasonably expect.

Clients that explicitly set max_tokens (now including helexa-acp
via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:38:28 +03:00
6cc14e925c feat(helexa-acp): per-endpoint max_tokens config
Some checks failed
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Failing after 1m3s
CI / Test (push) Failing after 1m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The agent was sending max_tokens: None, letting cortex/neuron pick
its own default — which trips Zed's "Output Limit Reached" on long
turns. Add a per-endpoint max_tokens option in EndpointConfig
(TOML key and HELEXA_ACP_MAX_TOKENS env var for the single-endpoint
fallback) that the agent threads into every CompletionRequest by
endpoint name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:34:23 +03:00
1c16732668 feat(helexa-acp): route Qwen3 inline <think> blocks to reasoning
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 26s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Clippy (push) Successful in 2m40s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Qwen3 emits chain-of-thought as literal <think>...</think> tags
inside delta.content rather than via the separate reasoning_content
field — so without parsing the markers, the thinking shows up in
the message pane as ordinary text. Add a small ThinkParser in
qwen3.rs (same chunk-boundary discipline as ToolCallParser) and
stage it after the tool-call parser in decode_stream: text events
from the tool-call parser are fed in and split into TextDelta /
ReasoningDelta. Zed now renders thinking in its dedicated thought
UI; visible answer text stays in the message pane.

The parking-lot entry from the plan is now closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:30:25 +03:00
5a0861d639 fix(helexa-acp): forward Dispatch::Response to its awaiting router
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Test (push) Successful in 5m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m51s
build-prerelease / Package cortex RPM (push) Successful in 1m29s
build-prerelease / Build neuron-ampere (push) Successful in 7m18s
build-prerelease / Build neuron-ada (push) Successful in 5m6s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The catch-all on_receive_dispatch handler was applying
respond_with_error to *every* Dispatch variant, including Response.
For Response variants, that call routes the error to the
ResponseRouter for the *outgoing* request — silently overwriting
the real reply from Zed with "Internal error: not implemented yet".

Every ACP roundtrip we issue (fs/read_text_file, fs/write_text_file,
session/request_permission, terminal/*) was therefore returning an
error to the tool runner regardless of what Zed actually responded.
The model saw uniformly-failing tools, gave up, and confabulated
plausible explanations.

Fix: pattern-match the Dispatch. Response → forward to its router
via respond_with_result. Request / Notification → keep the
"not implemented yet" error response as before.

Found via debug logs showing
  WARN helexa_acp::agent: unhandled ACP message method="fs/read_text_file"
right before every tool failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:16:21 +03:00
33652ac651 feat(helexa-acp): HELEXA_ACP_LOG_FILE env for editor-host logging
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m44s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Build neuron-blackwell (push) Successful in 6m1s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 8m23s
build-prerelease / Build neuron-ada (push) Successful in 5m26s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Editors that launch ACP agents (Zed today) don't reliably surface
the child's stderr — and `args` in an `agent_servers` config is
exec-args, not shell, so the usual `&>>` redirect trick doesn't
work. Add a HELEXA_ACP_LOG_FILE env var that, when set to an
absolute path, routes the tracing subscriber to append-write that
file (ANSI off) instead of stderr. RUST_LOG still controls levels.
Unopenable paths fall back to stderr with a warning so a typo
doesn't silence the agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:47:28 +03:00
c297a54074 chore(helexa-acp): log raw bash output and tool result snippets
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m38s
build-prerelease / Build neuron-blackwell (push) Successful in 4m34s
build-prerelease / Build cortex binary (push) Successful in 4m49s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 7m46s
build-prerelease / Build neuron-ada (push) Successful in 7m38s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Diagnostic for "the tool ran but the model thinks it failed" cases.
Logs at debug level:

- exec_bash: terminal/create command + cwd, terminal/exit code/signal,
  terminal/output bytes + truncated flag + 200-char snippet.
- dispatch_tool_call: 200-char snippet of every successful result
  before it's folded back into history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:15:26 +03:00
0121a1930f feat(helexa-acp): inject and parse Qwen3 Hermes tool format
Some checks failed
CI / Format (push) Successful in 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m45s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m0s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
The OpenAI `tools` API field isn't load-bearing in this stack —
neuron's chat template renders only message.content, so tool
definitions sent that way never reach the model. Move both sides
of the tool conversation into the Qwen3 Hermes wire format the
model is actually trained on:

- Append a `# Tools` block to the system prompt describing every
  available function (qwen3::render_tool_block).
- Parse `<tool_call>{json}</tool_call>` markers out of the streamed
  content via a chunk-boundary-safe state machine (qwen3::ToolCallParser),
  surfacing them as the existing CompletionEvent::ToolCall* events
  so the agent loop doesn't change.
- Re-serialise assistant turns that called tools with inline
  `<tool_call>` blocks and tool results as user turns wrapped in
  `<tool_response>` (qwen3::render_assistant_with_tool_calls,
  render_tool_response).

Verified against cortex+Qwen3.6-27B: the model produces a
well-formed `<tool_call>{"name":"list_dir","arguments":{"path":"/tmp"}}</tool_call>`
in response to a Hermes-formatted prompt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:06:38 +03:00
13f4c36aeb chore(helexa-acp): log outgoing chat-completion body at debug level
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 47s
CI / Clippy (push) Failing after 56s
CI / Test (push) Successful in 5m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m22s
build-prerelease / Build cortex binary (push) Successful in 6m51s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m14s
build-prerelease / Build neuron-ada (push) Successful in 5m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Useful for diagnosing "the model isn't using tools" — confirming
that helexa-acp is in fact sending the `tools` array (and what
messages, system prompt, etc. accompany it) without having to
attach a packet capture upstream of cortex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:38:10 +03:00
4a51a54554 fix(helexa-acp): describe Stage 3 tools in the default system prompt
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 42s
CI / Clippy (push) Successful in 2m39s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The Stage 2 prompt told the model it had no tools, which models
trained for caution then dutifully repeat back ("Stage 2 build: no
tools available — I can't read files…"). Stage 3 ships tools in the
CompletionRequest.tools array, but the system message was still
overriding that. Update the default prompt to list the five tools
and instruct the model to use them rather than asking the user to
paste contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:33:17 +03:00
0609f1ac5d feat(helexa-acp): add tools, session modes, and permission gating
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m38s
CI / Test (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m54s
build-prerelease / Build neuron-ampere (push) Successful in 7m54s
build-prerelease / Build neuron-ada (push) Successful in 4m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m14s
build-prerelease / Build cortex binary (push) Successful in 4m9s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 6m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 3m54s
Stage 3 introduces five tools (read_file, write_file, edit_file,
list_dir, bash) backed by ACP fs/* and terminal/* calls, a
ClientOps trait so the runner is mock-testable, two session modes
(default + bypassPermissions) with session/set_mode honouring them,
and a tool-call loop in the agent that streams the model, dispatches
each call, feeds results back into history, and re-enters until the
model finishes or MAX_TOOL_ROUNDS is hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:01:32 +03:00
96fc379893 feat(helexa-acp): wire ACP agent loop for text-only conversations
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 41s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m26s
CI / Test (push) Successful in 5m43s
build-prerelease / Build neuron-blackwell (push) Successful in 5m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m13s
build-prerelease / Build neuron-ada (push) Successful in 5m28s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Stage 2 lands the agent loop on top of the Stage 1 scaffold: session
state with per-session cancellation, a system-prompt builder honouring
HELEXA_ACP_SYSTEM_PROMPT_PATH / system_prompt_path TOML, and handlers
for initialize / session/new / session/prompt / session/cancel that
stream provider output back as session/update notifications. Verified
end-to-end against cortex from Zed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 09:46:22 +03:00
e267f583e1 chore(neuron): rustfmt drift in is_device_fault test
Some checks failed
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 58s
CI / Clippy (push) Failing after 3m43s
CI / Test (push) Successful in 5m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m41s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
One assert! call grew past the line limit after the previous commits;
cargo fmt --all picked it up. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:13:55 +03:00
e23d5011d0 feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat
Adds a new workspace crate `helexa-acp` (binary, Apache-2.0) — the
start of "the missing ACP binary" for multi-endpoint LLM setups
mixing public APIs, private LAN deployments, and various wire
formats. Today it speaks OpenAI /v1/chat/completions; the
Provider trait is the seam that lets OpenAI Responses, Anthropic
/v1/messages, and other wire formats slot in later without touching
the agent loop.

The crate is intentionally self-contained — no dependencies on the
other workspace crates (cortex-core, cortex-gateway, neuron) — so a
future migration to a dedicated GitHub repo is a Cargo.toml-only
change. All deps come from crates.io.

This commit lands:

  * `config.rs` — TOML config at $XDG_CONFIG_HOME/helexa-acp/config.toml
    with multi-endpoint support (each `[[endpoints]]` declares its
    name, base_url, wire_api, default_model, optional API key /
    api_key_env). Falls back to env-only single-endpoint config when
    no TOML exists (HELEXA_ACP_BASE_URL, HELEXA_ACP_MODEL, etc.). The
    `endpoint:model` selector syntax is validated and tested.

  * `provider/mod.rs` — `Provider` trait + provider-agnostic types
    (`CompletionRequest`, `CompletionEvent`, `Message`, `ToolCall`,
    `ToolSpec`, `Role`, `UsageStats`). Agent loop consumes these
    without knowing the wire format on the other side.

  * `provider/openai_chat.rs` — `OpenAIChatProvider` impl. Compatible
    with cortex, LM Studio, Ollama (compat mode), OpenRouter, OpenAI
    itself. Streams via reqwest + eventsource-stream + async-stream.
    Surfaces text deltas, reasoning deltas (for models that emit
    `reasoning_content`), tool-call lifecycle (start, args-delta,
    completion), usage, finish reason. Cancellation-token aware.

  * `main.rs` — tokio + stderr-only tracing-subscriber + Stdio
    transport. Builds a provider per configured endpoint at startup,
    surfacing config mistakes before the editor even initializes.
    Currently responds to `initialize`; everything else stubs to
    `not implemented yet` until the agent loop lands in the next
    commit.

12 unit tests pass — encoder shape, decoder shape (text-only,
tool-call progressive, cancellation, malformed-chunk recovery),
config parsing (multi-endpoint TOML, env fallback, validation).

The `#![allow(dead_code)]` on `provider/mod.rs` is temporary — the
agent loop in the next commit reads every field. It's noted in the
module-level docstring so the next reader knows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:13:47 +03:00
249b2e5c98 fix(neuron): only poison the model on actual device faults
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m22s
CI / Test (push) Successful in 4m55s
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-blackwell (push) Successful in 5m49s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m7s
build-prerelease / Build neuron-ada (push) Successful in 5m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CI / Format (push) Failing after 33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Previously every inference Err — shape mismatch, NaN logits, tokenizer
error, missing handle — marked the model poisoned and rejected every
subsequent request until an operator unload+reloaded. The benjy
incident on 2026-05-27 showed how this misfires: a concurrency bug
produced a `broadcast_add: shape mismatch` error that had nothing to
do with CUDA, but the model was taken down anyway.

Add `is_device_fault(err_chain: &str)` — a conservative classifier
that returns false only for errors we know are pre-kernel / CPU-side
(shape mismatches, NaN logits, tokenize/detokenize, missing handle,
DecodeStream, empty prompt). Everything else defaults to true so a
genuine driver fault still poisons.

Applied at all six poisoning sites:
  - chat_completion CUDA worker path
  - chat_completion CPU spawn_blocking path
  - chat_completion_stream CUDA worker path
  - chat_completion_stream CPU spawn_blocking path
  - chat_completion_tp non-streaming wrapper
  - chat_completion_tp_stream spawned task

Each site now logs either "model marked poisoned" (device fault) or
"model NOT marked poisoned" (non-device) so the journal makes the
classification visible. Tests cover the known non-device patterns and
a couple of real CUDA driver messages.

Pairs with the inference_lock commit (c59da83): together they
eliminate both the cause of the spurious-poisoning we just observed
(the shape mismatch) AND the over-reaction to it (the unconditional
poison). Each fix is independently useful but the combination is
what makes the system actually robust to concurrent agent workloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:57:48 +03:00
c59da83636 fix(neuron): serialise single-GPU inference per loaded model
Two concurrent chat_completion requests against the same single-GPU
model could interleave their `clear_kv_cache → forward(chunk0) →
forward(chunk1) → ...` sequences. The device-worker channel serialises
individual jobs but not the sequence boundary, so the cache could end
up holding tokens from one request while another's mask was sized for
its own prompt — producing a shape mismatch mid-prefill.

Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories`
and `memorize solutions` extensions fired 4ms apart against
Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV
cache, and request a08b4a's chunk 0 forward produced scores of shape
[1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add
failed, both requests bubbled the error up, both flipped the model to
poisoned.

Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring
the TpLoadedModel.pool lock that the TP path already held. Acquire
it at the start of `chat_completion` and inside the spawned task of
`chat_completion_stream` (so the role chunk goes out immediately and
only the inference work queues behind the lock).

The CPU branch uses `blocking_lock` from inside spawn_blocking; the
CUDA branch uses async `.lock().await` inside tokio::spawn.

Throughput impact: zero. The GPU was already serialised at the
device-worker channel — multiple requests just produced corrupt KV
cache state instead of clean serial throughput. The lock makes the
existing serialisation honest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:54:04 +03:00
f05882369d fix(neuron): don't poison the model on tokio JoinError panics
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m28s
build-prerelease / Build neuron-ampere (push) Successful in 8m25s
build-prerelease / Build neuron-ada (push) Successful in 8m54s
CI / Test (push) Successful in 4m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CUDA driver failures propagate as Err through `?` and become
`Ok(Err(InferenceError::Other(_)))` from the spawned task — those are
real device faults and still poison the model. Tokio JoinError is
different: it fires on Rust-level panic (tokenizer bug, sampler bug,
serialisation, the UTF-8 slice that landed in commit bd04d7f before
the fix) or task cancellation. Those don't touch the device context,
so failing the one request without tearing down the model is correct.

Two sites changed:

  - chat_completion's CPU spawn_blocking handler — JoinError no longer
    sets loaded.poisoned.
  - chat_completion_tp's tokio::spawn wrapper — JoinError no longer
    sets tp_for_marker.poisoned. The inner-Err case still does.

Each path logs the cause (panicked / was cancelled / ended abnormally)
explicitly so the journal makes the new behaviour obvious — search for
"model NOT marked poisoned" to find these events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:02:52 +03:00
bd04d7f580 fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic
When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji)
across multiple tokens, the previous "decode the cumulative token list,
byte-slice the delta against a stored prefix" pattern would panic with
'start byte index N is not a char boundary; it is inside <emoji>'.

The race: at step N the tokenizer renders the partial bytes as U+FFFD
(3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4
bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the
codepoint in step N+1's `full` string, and `&str[start..]` panics.

Replace with tokenizers' `DecodeStream::step(id)` which maintains an
internal byte buffer across token boundaries and only emits when a
clean codepoint completes. Applied at all three SSE emission sites:

  - stream_inference_via_worker (single-GPU CUDA stream)
  - chat_completion_tp_stream's spawned task (TP stream)
  - run_inference_streaming (CPU stream)

The shared emit helper splits into emit_delta (async, mpsc::send) and
emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its
existing send semantics. The old emit_chunk helper that did the
unsafe full-decode-and-slice is removed entirely.

Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call
response after a long agent-zero session; the spawned TP stream task
panicked at candle.rs:2648. The model itself stayed healthy (no CUDA
fault), only the one streaming request died.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:01:24 +03:00
1e13889392 feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m1s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-blackwell (push) Successful in 11m7s
build-prerelease / Build neuron-ampere (push) Successful in 12m16s
build-prerelease / Build neuron-ada (push) Successful in 12m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload
cycle observed on beast under agent-zero workloads. Three changes,
all keyed off env-driven knobs so an operator can tune without a
rebuild:

1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The
   initial forward is split into N-token windows, each with a
   monotonically growing offset. KV cache accumulates across chunks
   exactly as it would under one big prefill; only the final chunk's
   logits are kept for sampling. Activation memory now scales with
   chunk size instead of prompt length, so a 13 k-token prompt stops
   holding tens of GB of intermediate activations live at once.

   Wired into all six prefill call sites:
   - run_inference / run_inference_streaming (CPU path)
   - run_inference_via_worker / stream_inference_via_worker (CUDA
     single-GPU through device worker)
   - chat_completion_tp_inner / chat_completion_tp_stream (TP via
     WorkerPool)

   Three helpers — chunked_prefill_local, chunked_prefill_via_worker,
   chunked_prefill_tp — own the loop shape so the chunking semantics
   stay identical across paths. Per-chunk debug log shows progress.

2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384).
   Requests above the cap return a structured 400 with
   `code: prompt_too_long` rather than going through the prefill and
   discovering the limit by OOMing partway through. New
   InferenceError::PromptTooLong variant.

3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500).
   If `vram_free_mb` is below the threshold at request start (e.g.
   another concurrent request is mid-prefill), reject with a clean
   503 + `code: insufficient_vram` rather than starting work that
   will OOM. New InferenceError::InsufficientVram variant. CPU loads
   (vram=0 sentinel) skip this check.

All three gates fire BEFORE any device work, so a rejected request
costs ~one tokenisation pass and never touches the worker thread —
poison cascades from rejected work are now impossible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:46:54 +03:00
35876954cd chore(neuron): default tracing filter to info (was info,neuron=debug)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m43s
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 5m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
Production deployments that want neuron-internal debug detail (e.g.
trim_device_pool's per-clear-kv line, slab inserts/drops) override
RUST_LOG explicitly via systemd. Defaulting to debug for the whole
neuron target produced a lot of journal volume that wasn't useful in
the common case.

beast already sets RUST_LOG=debug in
/etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity
is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:47:30 +03:00
cdf0f4e66d fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM
cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync
returns memory to the device's default mempool, not to the OS), so
mem_get_info under-reports free VRAM between requests. With
Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw
~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into
CUBLAS_STATUS_INTERNAL_ERROR depending on quant.

Add a cuda-gated trim_device_pool helper that, after each successful
clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool,
0) against the device's default mempool. Failures (no async-alloc
support, transient driver errors) are non-fatal and log at debug. The
before/after free-VRAM delta is logged so an operator can correlate the
trim with the next request's prefill VRAM.

ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors
correctly; the leak was strictly at the cudarc pool layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:36:13 +03:00
b4f3576d82 refactor(neuron): phase 4 — model loads move onto the device worker
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Final structural slice of the per-device CUDA context-ownership
refactor. The four remaining spawn_blocking sites that did CUDA work
on the leader are gone:

- Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) →
  `Job::LoadGguf` dispatched on the worker.
- Single-GPU dense load (`load_arch_dense` spawn_blocking) →
  `Job::LoadDense` on the worker.
- TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) →
  `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()`
  directly — no cross-thread `Arc<Comm>` transfer, no `SendComm`
  wrapper for this path.

The Phase 2 / Phase 3 bridges that moved freshly-built models across
the channel boundary (`Job::TransferIn`, `Job::TransferInTp`,
`Job::CloneLeaderComm`) are removed. Models are now constructed on
the worker thread directly; the slab gets populated by `insert_arch` /
the inline `tp_models.insert` in dispatch handlers.

What this phase preserves:

- CPU loads still use `tokio::task::spawn_blocking` against
  `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and
  channel overhead would only add latency. Four `spawn_blocking`
  references remain in `candle.rs` (load_arch_gguf, load_arch_dense,
  chat_completion, chat_completion_stream) and all are deliberate
  CPU-only fallback.
- Public API unchanged. `Harness::load_model`, `chat_completion`,
  HTTP routes all keep identical signatures.

What this phase removes:

- `SendComm` wrapper is no longer used in the load path (the Phase 3
  bridge that justified it). It remains in `nccl_state.rs` for the
  Phase 1–3 era and any future cross-thread Comm move; consider
  deleting in a follow-up.
- `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and
  their handle convenience methods deleted.
- The leader_device parameter on `load_dense_shard` is now `_` —
  unused since the worker has its own bound device. Removing the
  arg outright is a public-API change; keeping the underscore prefix
  preserves the signature and signals deadness without churn.

Helper relocation:

- `LlamaDense::from_parts` is a new pub(crate) constructor so the
  worker-thread loader can build a `LlamaDense` without going through
  the original `load_arch_dense` async function.
- `check_dense_config_supported` is bumped to `pub(crate)` for the
  same reason.

Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/`
returns only CPU-fallback hits in `candle.rs` + doc-comment references
to the old design. All four leader-side CUDA `spawn_blocking` sites
are gone.

fmt + clippy clean; 37 lib tests + all integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:24:38 +03:00
76ab24d98c refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker
Some checks failed
CI / Format (push) Successful in 29s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m13s
build-prerelease / Build neuron-blackwell (push) Successful in 4m1s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Third slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The
leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers,
the leader-side KV cache reset, and the TP forward step itself now all
run on the per-device worker thread — the same OS thread that bound
the leader's `CudaContext` at startup.

What this phase changes:

- `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3
  bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`,
  `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key.
- `DeviceWorkerState` gains `nccl: NcclState` and
  `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter).
- `WorkerPool` loses its `leader_nccl` field; gains a
  `leader_worker: Arc<DeviceWorkerHandle>` passed at construction.
  `init_nccl`, `nccl_sanity_check`, `load_dense_shard`,
  `generate_step`, `clear_kv_cache` all route their leader-side ops
  through `Job::Nccl*` / `Job::Tp*` instead of spawn_blocking against
  a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead
  of a device-resident `Tensor` — the worker copies logits to CPU
  before reply so the async caller can sample on a CPU candle
  tensor with zero device-context touch.
- `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque
  `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the
  worker thread's slab; both the model's CUDA tensors and the
  embedded `Arc<Comm>` clones release on the same thread that
  allocated them (the Drop semantics constraint cudarc forces).
- `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still
  runs in spawn_blocking and needs the leader's `Arc<Comm>` to build
  the row-parallel layers' AllReduce ops. The Job clones the Comm
  out of the worker's NcclState and ships it back as `SendComm`.
  Phase 4 deletes this bridge when the load itself moves onto the
  worker.
- `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the
  no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`)
  still flow through the same channel uniformly; the cuda-only
  TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp)
  remain gated.

What this phase doesn't touch (yet):

- TP shard load itself — still spawn_blocking, bridged via
  `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and
  reads `state.nccl.comm()` directly inside the worker.
- Single-GPU model loads — still spawn_blocking, transferred via
  `Job::TransferIn`. Phase 4 moves them.
- `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete`
  helpers — still present, used inside spawn_blocking load closures.
  Phase 4 cleanup folds them into `dispatch.rs`.

`tp/mod.rs::WorkerPool::spawn` gained a required
`leader_worker: Arc<DeviceWorkerHandle>` argument. Three external
callers were updated: `CandleHarness::load_tp` (passes the cached
device worker), `main.rs::tp_smoke` (spawns a fresh worker), and
the two `tp_worker_lifecycle*.rs` integration tests.

Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass. CUDA-only TP integration smoke deferred to
the next deploy on beast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:16:02 +03:00
b179204fd3 refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Resolve version stamps (push) Successful in 3m41s
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build cortex binary (push) Successful in 12m20s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Second slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two
spawn_blocking sites in `chat_completion` and `chat_completion_stream`
now route through the device worker thread on CUDA loads. CPU loads
keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's
no context to own and the channel hop would only add latency.

What this phase changes:

- `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The
  worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>`
  slab and a `next_handle` counter for minting opaque handles.
- `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`,
  plus a new `arch_handle: Option<ArchHandle>` field. The two are
  mutually exclusive: CUDA loads set `arch_handle = Some(_)` after
  transferring the boxed arch into the worker's slab; CPU loads keep
  `arch = Some(_)` for the legacy spawn_blocking path.
- New `run_inference_via_worker` and `stream_inference_via_worker`
  drive the prefill + decode loop by sending `Job::ForwardLogits` per
  step; the worker copies the resulting `[vocab]` logits to a
  CPU-side `Vec<f32>` before reply, so the async caller never holds a
  device-resident tensor. `apply_repeat_penalty` and
  `LogitsProcessor::sample` run on a CPU candle tensor; no context
  binding side-effects on tokio worker threads.
- `logits_health_slice(&[f32])` complements the existing
  `logits_health(&Tensor)` so the new worker paths can compute
  health stats directly from the CPU vec.
- `unload_model` for the single-GPU CUDA path now sends
  `Job::DropArch { handle }` to the worker so the `Box<ModelArch>`
  drops on the thread that allocated its CUDA tensors. The `Drop` runs
  with the bound context, freeing memory on the right context.

What this phase doesn't touch (yet):

- TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3.
- Single-GPU model load — still spawn_blocking, followed by a
  `Job::TransferIn` to move the freshly-built `ModelArch` into the
  worker slab. Phase 4 moves the load itself onto the worker thread
  and eliminates the bootstrap TransferIn.
- The `device_vram_mb` / `cuda_mem_mb` helpers — still present and
  used by the construction-time logs running inside spawn_blocking
  loads. Phase 4 cleanup folds them into `dispatch.rs`.

Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 09:55:08 +03:00
081b532387 refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it
Some checks failed
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Failing after 59s
build-prerelease / Build neuron-blackwell (push) Successful in 3m30s
CI / Test (push) Successful in 4m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m16s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
First slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the
infrastructure for a dedicated OS thread per CUDA device that owns the
device's `CudaContext` for the daemon's lifetime, and routes the 8
async-context `device_vram_mb()` call sites in candle.rs through it.

What this phase changes:

- New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs).
  `DeviceWorkerHandle::spawn(idx)` creates a named OS thread
  (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and
  enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel.
  Replies cross back via `tokio::sync::oneshot::Sender` so async callers
  await without parking a tokio worker.
- Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward,
  ClearKv, NCCL init/sanity, and load variants.
- `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at
  load time by a new `CandleHarness::ensure_device_worker(idx)` method
  that lazily spawns + caches one worker per device index.
- Per-model `query_vram()` convenience method on both struct types so
  the 8 call sites in chat_completion / chat_completion_stream /
  chat_completion_tp_inner / chat_completion_tp_stream become
  `loaded.query_vram().await` (or `tp.query_vram().await`) — same field
  values logged, just sourced from the owner thread instead of the
  caller thread.

What this phase doesn't touch (yet):

- Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`.
  Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP
  forward + NCCL bring-up; Phase 4 moves the loads and deletes the now-
  unused `device_vram_mb` / `cuda_mem_mb` helpers.
- Public API — unchanged. `Harness::load_model`, `chat_completion`,
  HTTP routes all keep identical shapes.

Tests:

- 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query
  → shutdown round-trip, thread naming, post-shutdown submit returns
  `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across
  a Shutdown. CPU build (the only one CI runs) is enough to exercise
  channel mechanics.
- All 37 lib tests + all integration tests pass; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 09:40:34 +03:00
7c19da9361 feat(neuron): construction-complete vram/config dump + logits health + per-step vram
All checks were successful
CI / Format (push) Successful in 40s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-blackwell (push) Successful in 4m0s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Successful in 5m10s
build-prerelease / Build neuron-ada (push) Successful in 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B
incident from "guess at KV cache / quant sizes" into "read the
journal":

1. Construction-complete summary in TpQwen3_5ForCausalLM::load and
   TpQwen3ForCausalLM::load. After the last "after layer N" log fires,
   each rank emits a single info line with: free_mb/total_mb (the
   number that drops by ~9 GB between per-layer and first-request on
   beast, with no inference traffic), every resolved config knob
   (vocab_size, hidden_size, num_layers, head_dim, num_kv_heads,
   max_position_embeddings), and a per-token KV-cache byte estimate.
   For Qwen3-Next also includes the linear/full-attention layer split
   so the hybrid architecture's cache cost is unambiguous.

2. Logits health snapshot on sample failure. Today the failure logs
   "A weight is negative, too large or not a valid number" with no
   context — was it a NaN cascade, an Inf, a negative weight?
   `logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts
   plus finite_min/max/mean on the failure path (zero cost on the
   success path) and emits a warn line just before the wrapper's
   terminal "failed, model marked poisoned" log. Wired into both the
   prefill and decode sample sites of the non-streaming AND streaming
   TP chat paths.

3. VRAM snapshot at prefill complete + every decode step. The
   "prefill complete" info line now carries vram_free_mb so the
   activations + KV growth from the prefill itself is visible. The
   per-step trace line gets vram_free_mb too, so an operator running
   with RUST_LOG=trace can watch headroom shrink token by token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 09:04:55 +03:00
24e20dcb5c feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m21s
CI / Test (push) Successful in 4m40s
build-prerelease / Build neuron-blackwell (push) Successful in 3m38s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build neuron-ada (push) Successful in 4m45s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 9m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Operators can now define tier aliases in models.toml:

  [aliases]
  "helexa/small" = "Qwen/Qwen3-1.7B"
  "helexa/balanced" = "Qwen/Qwen3-8B"
  "helexa/large" = "Qwen/Qwen3.6-27B"

A client request for `model: "helexa/small"` is resolved to the concrete
model id at routing time. The gateway also rewrites the proxied body's
`model` field to the concrete id so neuron sees a name that matches its
loaded handle (otherwise the harness rejects the request).

Motivated by the finger-in-the-wind benchmark: same "what's the capital
of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with
identical correctness. Aliases let clients pick a latency tier without
hardcoding model ids, and let operators swap targets without changing
client code.

Changes:
  * cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>`
    + `resolve_alias(&str) -> &str`. Unit tests cover the basic
    resolution + TOML round-trip.
  * cortex-gateway:
    * `RouteDecision` gains `resolved_model_id: String`. `router::resolve`
      consumes aliases at entry and threads the concrete id through.
    * Handlers (chat_completions, completions, anthropic_messages
      streaming + non-streaming) rewrite the body's `model` field with
      `rewrite_model_in_body` before proxying, using the resolved id
      for metrics labels, LRU touch, and the body itself.
    * `/v1/models` (Pass 4) emits each alias as its own entry mirroring
      the target's `loaded` flag, feasible_on, and locations — clients
      browsing the endpoint see both names and can pick either.
  * `models.toml` declares the three tier aliases; `models.example.toml`
    documents the section as opt-in.
  * Integration tests verify: end-to-end alias→concrete request flow,
    alias surfacing in /v1/models, and no-op fall-through for
    non-alias model ids.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 16:10:41 +03:00
b9e7a76a7a feat(gateway): surface mid-prewarm models as Loading on /v1/models
The poller now fetches /health alongside /models on each neuron and
stashes the activation snapshot on NodeState. The /v1/models handler
gains a Pass 3 that synthesises Loading locations from each neuron's
activation.in_progress and activation.pending lists, so a catalogued
model that's mid-prewarm surfaces as `status: "loading"` rather than
appearing absent (loaded=false, locations=[]).

Without this, a client polling /v1/models during a beast restart sees
Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then
reappear. Now it stays visible the whole time with a clear status.

Adds ModelStatus::Loading to cortex-core. The router's per-node priority
loop gets an explicit (no-op) arm: Loading models aren't routable yet,
and falling through to the catalogue cold-load path is the existing
race — no worse than before, but tagged as a known follow-up needing
neuron-side in-flight tracking on /models/load.

New test_poller_captures_activation_from_health exercises the full
round-trip: mock neuron with empty /models but a pre_warming /health
→ poller writes node.activation. Common test helpers gain
spawn_mock_neuron_with_models_and_health and default_health_response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:26:12 +03:00
800498f530 feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.

1. main.rs no longer awaits load_default_models before binding axum.
   The listener binds first; pre-warm runs in a spawned background
   task that holds a read lock on the harness registry for the
   duration of its sequential load loop. Concurrent on-demand
   /models/load and /v1/chat/completions traffic still flow.

2. /health gains an `activation` field carrying:
     state         pre_warming | ready
     pending       model ids queued but not started
     in_progress   model id currently loading (Option)
     completed     model ids loaded successfully this activation
     failed        [{model_id, error}] for failed entries
   The field is `#[serde(default)]` so a pre-change cortex polling a
   new neuron — or vice versa — keeps working.

`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.

Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:18:04 +03:00
2740e61a23 fix(neuron,candle): name lifetime on acquire_pool_lock
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 46s
CI / Format (push) Successful in 46s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 5m8s
build-prerelease / Build cortex binary (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 5m25s
build-prerelease / Build neuron-ada (push) Successful in 5m3s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 7m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Lifetime elision fails when a function has two reference parameters
and returns a borrow: rustc can't infer whether the MutexGuard's
lifetime ties to `pool` or `model_id`. The non-CUDA build skipped
this code path (cfg-gated), so the error only surfaced on the GPU
build at https://git.lair.cafe/helexa/cortex/actions/runs/162.

The guard borrows the pool, so name the lifetime on `pool` and the
return type. `model_id` keeps its independent (elided) lifetime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:37:32 +03:00
67f79c868f fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m46s
build-prerelease / Build neuron-blackwell (push) Failing after 3m32s
CI / Test (push) Successful in 4m25s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two failure modes from the 2026-05-26 beast incident:

1. `unload_all_models` looped through models calling `unload_model`,
   logging individual failures at warn. The cumulative effect was a
   single warn line for the failed unload then "shutdown complete" —
   no signal that the model was actually still loaded. Now each unload
   is bounded by a 20s timeout, failures escalate to error, and a
   summary "leaving N model(s) loaded" line fires when anything is
   stuck so the operator knows the OS will reclaim VRAM after exit.

2. Returning `Ok(())` from `main` after the unload sweep dropped the
   tokio runtime, which then waited indefinitely on a CUDA-stuck
   spawn_blocking thread (the journal's "Stack trace of thread
   2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec
   fired 2 minutes later, SIGABRT, core dump. Replacing the return
   with `std::process::exit(0)` skips the runtime drain and hands the
   OS a clean exit code; stuck threads get reaped with the process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:30:06 +03:00
fc6ef0ee0f feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups
Once a CUDA driver error has hit a forward or kv-cache call, the
device's context is unrecoverable in-process — subsequent kernels can
hang (the failure mode seen on beast on 2026-05-26), return garbage,
or trip another illegal-address. The harness now marks the model
poisoned on any forward / spawn_blocking / TP-task failure, refuses
further inference against it with a clear "unload and reload" error,
and surfaces `status: "poisoned"` on `/models` so an operator running
`curl beast:13131/models` (or cortex polling) can see the bad state.

Without this, a single OOM on a too-large prefill quietly turned every
subsequent request into a stuck wait on the pool lock; with it, the
first request fails fast with the driver error in the journal and the
client gets a usable 5xx instead of a hung connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:28:42 +03:00
1385979e3d feat(neuron,candle): log per-device VRAM at chat_completion start
Every "starting" log line now carries vram_free_mb / vram_total_mb for
the request's serving device (the leader device on TP). On the 2026-05-26
incident this would have made the 14k-token prefill OOM diagnosable from
the first log line: with ~412 MB free, that prompt was never going to
fit, and the operator could have caught the imbalance before the CUDA
context got poisoned.

`device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is
kept separate to avoid coupling the inference path to the TP module.
TpLoadedModel gains a `leader_device: Device` clone so the request
path reads the device without locking the leader model (which would
contend with an in-flight forward).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:26:23 +03:00
0a1cfcd4d0 feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings
Every chat completion path (single-GPU + TP, streaming + non-streaming)
now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber
prefixes every event with that span so `grep req_id=…` over journalctl
reconstructs one request even when dozens overlap.

Every path also emits a terminal log line on both success ("done", with
prompt_tokens/completion_tokens/finish_reason/total_ms) and failure
("failed", with full anyhow chain + total_ms). Failures used to vanish
silently — a request that hit a CUDA OOM left "starting" in the journal
and no further trace.

New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await`
in both TP paths. It warns at 2s ("still waiting on pool lock") and
re-warns every 2s thereafter, so queued requests stuck behind a
deadlocked holder are visible immediately instead of looking like idle
silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:25:11 +03:00
ea0e0f7911 fix(neuron,tp): log leader forward errors with full context
Worker rank failures were already surfaced at WARN, but the leader's
own forward Result::Err was silently coerced to a `leader_ok=false`
bool. When the leader and a worker both fail together — the typical
shape of a CUDA OOM cascading into an illegal-address — the journal
showed only the worker side and an operator had to guess what hit
rank 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:22:30 +03:00
aa88d37509 fix(gateway): full observability + stop leaking upstream bodies
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 42s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 4m53s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Comprehensive sweep across cortex-gateway's request handling. Every
failure path now emits exactly one structured warn (or error) event
on the cortex side with the wire-level detail an operator needs;
the API response carries only a generic message plus, where useful,
the upstream status code.

proxy.rs::forward_request:
- warn on network failure (network error, target URL).
- warn on upstream non-2xx (status, target URL). Streaming body still
  passes through to the client; we just can't snippet without
  breaking the stream.
- warn on response-build failure.
- ProxyError::into_response no longer interpolates the inner error
  into the API body — generic "upstream request failed" / "failed to
  build response" instead.

handlers.rs::chat_completions, handlers.rs::completions:
- warn on missing model field, with handler= label.
- warn on route resolve failure with model + error chain. The
  user-facing 404 keeps the RouteError Display string (which is
  short, informative, and contains no internal detail beyond the
  model id and config'd node names).

handlers.rs::anthropic_messages:
- warn on invalid Anthropic body, on translated-OpenAI serialise
  failure (which is internal), on route resolve, on upstream network
  error, on upstream non-2xx (with 512-char body snippet for parse
  errors), on upstream body read, on response parse.
- All warns share consistent field shape: handler, model, node, url,
  status / error / body as applicable.
- API response messages are now uniformly generic.
- Adds an info-level "proxying request" log on the non-streaming
  path so successful proxies are also visible.

handlers.rs::proxy_with_metrics:
- still calls e.into_response() but proxy::forward_request already
  warn'd at the wire layer, so no double-log here.

Tests:
- All 32 existing unit tests + 22 gateway integration tests + 4
  new router tests pass.
- Tests that asserted on the "no healthy nodes" / "not found"
  strings still match because RouteError messages are preserved
  in the 404 user-facing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:17:26 +03:00
0f00f72b47 fix(router,handlers): strip trailing slash from rewritten URL + log upstream failures
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m41s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled bugs surfaced after 9b0ed0b:

1. url::Url::parse("http://host:port").to_string() normalises the
   empty path to "/", so rewrite_loopback_host was returning
   "http://beast:13131/". Downstream callers then did
   format!("{endpoint}/v1/chat/completions") and produced a
   double-slash path that neuron's axum router 404'd with an empty
   body. Strip the trailing slash in the rewriter so the endpoint is
   a clean base string for concatenation.

2. The anthropic_messages handler returned the upstream's empty body
   to the API caller as `"upstream error: "` with no journal log on
   the cortex side. Operators had no way to see what happened. Add
   warn-level tracing on both upstream failure paths (network error
   and non-2xx) with model, node, target URL, status, and a 512-char
   body snippet. The API response now carries just `"upstream
   returned <status>"` — the implementation detail lives in the log.

Updates the two existing rewrite tests for the no-trailing-slash
output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:10:39 +03:00
9b0ed0b57f fix(router): rewrite loopback inference URLs to use neuron's host
Some checks failed
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 41s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Clippy (push) Successful in 7m25s
build-prerelease / Build neuron-ampere (push) Successful in 4m57s
build-prerelease / Build cortex binary (push) Successful in 4m15s
build-prerelease / Build neuron-ada (push) Successful in 5m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
CI / Test (push) Failing after 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Neuron hardcodes its bind_url as `http://localhost:13131` (it can't
reliably know its own externally-resolvable name). When cortex runs
on a different host than the neuron it's routing to, blindly
proxying to that URL hits localhost on the cortex box instead of the
neuron.

Cortex already knows each neuron's reachable host from cortex.toml.
After fetching the inference URL from `/models/{id}/endpoint`, if
the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1),
swap it for the configured neuron host. Preserve the port and path
from neuron's URL so a future harness serving inference on a
different port than the management API still works.

Adds `url` (already a transitive dep via reqwest) as a direct
dep for the URL parsing.

Tests cover: localhost rewrite, distinct inference port preservation,
non-loopback passthrough, malformed input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:23:47 +03:00
e71181499e feat(stage-8e-3): quantize lm_head in TP Qwen3-Next
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 42s
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ampere (push) Successful in 5m23s
build-prerelease / Build neuron-ada (push) Successful in 4m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m42s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the
load spec has quant set and tie_word_embeddings is false, lm_head's
(vocab_size, hidden_size) weight is quantized in-situ at load time
along with all the per-layer linears. The non-tied case on
Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2
bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of
decode latency from the per-token logits matmul.

Tied case (tie_word_embeddings=true) keeps the lm_head plain even
when quant is set — quantizing the shared tensor would corrupt the
embedding lookup, and the tied case already gets the memory win
from only holding one copy.

This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path.
The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's
the bottleneck for a model that actually needs TP at consumer scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:53:14 +03:00
ee663e5e99 fix(stage-8e-2e): bump quant prefill threshold to M > 64
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Clippy (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The M > 8 threshold from 8e-2d activated forward_via_f16 on the test
case (M=30) and slightly regressed prefill (143 -> 133 T/s). The
dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms)
eats the cuBLAS GEMM speedup at small M.

Move the crossover to M > 64 so short prefills (typical for the
validate probe) stay on the GGUF GEMV kernel where per-call cost is
comparable but the dequant tax is zero. Long prefills still get the
dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the
fixed dequant cost.

Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that
needs either a dequant cache (gives back the ISQ memory win) or a
fused dequant+gemm kernel. Both larger projects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:50:45 +03:00