cortex

Author	SHA1	Message	Date
rob thijssen	d311c8ca7a	feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 45s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m41s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m20s Details build-prerelease / Build neuron-ampere (push) Successful in 7m18s Details build-prerelease / Build neuron-ada (push) Successful in 5m10s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details - PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS / NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain 256²…1024² (64…1024 LM tokens/image). - Test: a max-resolution source caps within the token budget (can't blow NEURON_MAX_PROMPT_TOKENS). - Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from the preprocess, mod, and rope doc-comments now that resolution is dynamic and M-RoPE is implemented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:50:03 +03:00
rob thijssen	c97a8654f5	feat(neuron): dynamic-resolution images via Qwen smart_resize (#14 ) Some checks failed CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 34s Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Replace the fixed 448×448-square preprocess with native-aspect `smart_resize`, and thread the resulting per-image grid through the LM so spatial structure survives non-square images (documents, screenshots, charts, panoramas, OCR) instead of being squished into a square. - preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32; pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens). `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/ `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for the TP leader's token count. - rope.rs: `compute_mrope_index`/`get_rope_index` take per-image `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`. Walk image runs in order, validate `run == gh*gw`, emit row-major positions, resume the shared counter at `base + max(gh,gw)`. Correct for multiple images of differing grids interleaved with text. - candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the `image_grid_factor` (patch×merge) instead of the constant 196; all four prompt-build sites compute per-image counts from each image's resized grid (single-GPU from the extracted `ImageInput.h/w`, TP from `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`. - single-GPU (`mod.rs`, `dispatch.rs`) and TP (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`, `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank recomputes grids from its own deterministic preprocess — no rpc.rs change, single source of truth. The vision tower itself was already grid-general (recent pos-embed interpolation + 2D rotary fix). No patch-count cap: pos-embed is interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT attention + prefill) instead. Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index` non-square + two-image + mismatch cases, square-grid regression guard. Non-cuda build + clippy + full workspace tests green; TP load/dispatch paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget config + remaining doc cleanup follow in C5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:47:27 +03:00
rob thijssen	dc048ffcc9	fix(neuron): vision-tower 2D positions + M-RoPE default on All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m59s Details CI / Test (push) Successful in 6m35s Details build-prerelease / Build neuron-ampere (push) Successful in 7m51s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s Details Two fixes to the spatial handling of images, validated against the HF transformers 4.57.1 qwen3_vl reference on beast. Vision tower (the real cause of poor spatial vision). The Stage-A tower encoded position two ways wrong, so the model saw image content but not layout (a row of 5 people read as "a line of 23", sky inverted), regardless of the LM-side rope: - Learned pos-embed was a naive sequential lookup of the first `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid — wrong stride for a 28×28 patch grid. Now bilinearly interpolates the grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major. - The 2D vision rotary was absent entirely. Added `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch `(row, col)` rotary to q/k in every ViT block via rope_slow, matching HF `apply_rotary_pos_emb_vision`. Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1` revert each for A/B (no rebuild). New unit tests: interpolation reduces to the sequential lookup at the native grid; rotary row/col structure. M-RoPE default on. The interleaved M-RoPE matches HF apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥ plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in — defaults should encode the model's trained behaviour, not freeze the broken state. Vision tower is plain candle (CPU-testable): built, clippy-clean, full workspace tests green locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 20:53:07 +03:00
rob thijssen	7ebcfba5ca	fix(neuron): gate M-RoPE behind NEURON_MROPE (default off) All checks were successful CI / CUDA type-check (push) Successful in 33s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m34s Details build-prerelease / Build cortex binary (push) Successful in 4m33s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m14s Details CI / Test (push) Successful in 6m50s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ada (push) Successful in 5m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details On beast the interleaved M-RoPE degraded image understanding rather than fixing it: the model misread spatial layout (a horizontal row of people described as a "diagonal receding line"), got attributes wrong, and rambled — a "how many people" follow-up generated 4459 tokens over 3.5 minutes, past agent-0's HTTP timeout (the "fails to respond without an error"). The interleave is evidently not numerically correct, and it can't be validated remotely without a transformers reference. Gate it: `get_rope_index` now returns plain sequential identity positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to plain RoPE and image tokens behave exactly as pre-M-RoPE (content recognition works; spatial layout approximate; no rambling). The real computation moves to `compute_mrope_index` (still unit-tested). Default off restores the working vision and unblocks agent-0; the M-RoPE code stays in place to debug + validate before flipping the default on. Pure non-cuda change (rope.rs); both single-GPU and TP forwards call the gated get_rope_index unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 19:32:59 +03:00
rob thijssen	825bf4e905	feat(neuron): M-RoPE Stage 4 — wire interleaved M-RoPE into the TP path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 42s Details build-prerelease / Build cortex binary (push) Successful in 5m9s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details CI / Test (push) Successful in 7m19s Details build-prerelease / Build neuron-ampere (push) Successful in 8m40s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details CI / Clippy (push) Successful in 2m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Mirror Stage 3 into the tensor-parallel Qwen3.6 model: - TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar offset and apply via apply_cos_sin. - TpQwen3_5Model gains the replicated rotary + rope_delta (reset in clear_kv_cache, settable). forward_inner builds the cos/sin once — interleaved M-RoPE from explicit position_ids (vision) or plain at offset+rope_delta (text/decode). forward() and forward_with_positions() delegate; the old single-shot forward_with_vision is gone. - prefill_with_images_chunked now computes get_rope_index over the whole prompt once, stores rope_delta on the base model, and slices the (3, prompt_len) position tensor per chunk — so every rank assigns image tokens their 14×14 grid coordinates and steps in lockstep (every chunk, text or image, carries the M-RoPE slice because the image shifts the surrounding text positions). Also build the position-id tensor as f32 directly (positions are small integers, exact in f32) to avoid an i64→f32 cast on the GPU. The TP forward is cuda-gated — CI CUDA type-check is the compile gate. Non-cuda build + clippy + full workspace tests green; rope math + the plain-RoPE-reduction invariant covered by unit tests. Completes the interleaved-M-RoPE work for the vision spatial misread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:46:27 +03:00
rob thijssen	4c12c7e2f0	feat(neuron): M-RoPE Stage 3 — wire interleaved M-RoPE into single-GPU Qwen3_5Model now builds the rotary cos/sin once per forward and threads (cos, sin) through the decoder → full-attention → rope, replacing the scalar offset that reached RotaryEmbedding: - vision forward computes get_rope_index over the (single-shot) prompt, sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens carry their 14×14 grid (height/width) positions; - text / decode take plain_cos_sin at offset + rope_delta — with rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and the device→host id copy is skipped on the text decode hot path. rope_delta is stored on the model and reset in clear_kv_cache, so decode after a vision prefill resumes text positions from the image-compressed counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset; linear-attention layers are unchanged (no RoPE). The TP path still uses the retained apply(offset) — wired in Stage 4. Full workspace tests green; the load-bearing invariant (M-RoPE == plain for equal axes) keeps text unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:39:52 +03:00
rob thijssen	ba1b5ba408	feat(neuron): M-RoPE Stage 2 — get_rope_index position-id helper Pure function computing the interleaved-M-RoPE 3D position ids for a prompt with image-placeholder runs, plus the decode rope_delta: text tokens advance a single counter (all axes equal); each image run gets [base+t, base+h, base+w] row-major over a square grid_t=1, grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from base + max(grid). rope_delta = final_counter - seq_len lets decode resume text positions after the position-compressed image blocks. Plus mrope_position_tensor to build the (3, seq) tensor. Unit tests: text-only is sequential (delta 0); text+image+text matches hand-computed grid ids + resume + delta; 196 → 14×14; non-square run rejected; end-to-end through mrope_cos_sin tracks the height axis. #[allow(dead_code)] until Stage 3/4 wire it into the forward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:34:28 +03:00
rob thijssen	5731f4c318	feat(neuron): M-RoPE Stage 1 — interleaved rope machinery + config Parse + store mrope_section / mrope_interleaved in RopeParameters (previously accepted-but-ignored). RotaryEmbedding gains: - inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section; - plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode); - mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the interleave columns (vision); - apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out. The existing apply(q,k,offset) is retained (delegates to plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages 3–4 move cos/sin construction into the model forward and thread the 3D position ids for image tokens. Tests: masks partition the half-dim; interleave drives the right axis per column; and the load-bearing invariant — mrope_cos_sin reduces bit-for-bit to plain_cos_sin when the three axes are equal (so text inference is unchanged). Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda; no behaviour change until wired. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:31:15 +03:00
rob thijssen	fa013505d1	fix(neuron): chunked TP-vision prefill + pre-flight VRAM guard All checks were successful build-prerelease / Resolve version stamps (push) Successful in 29s Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m6s Details build-prerelease / Build neuron-ampere (push) Successful in 8m30s Details CI / Format (push) Successful in 38s Details CI / CUDA type-check (push) Successful in 47s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details CI / Test (push) Successful in 6m3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m32s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details agent-0 sent a ~13k-token prompt + image; the TP vision prefill was single-shot, so it tried to materialise activations for all 12,960 positions at once and OOM'd rank 1 mid-forward. Rank 1 died before issuing its row-parallel AllReduce, stranding rank 0 on the collective (it hung holding the pool lock). The text path survives the same size because it chunks the prefill. Chunk the vision prefill the same way: - TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s) once, then walks the pre-expanded prompt in prefill_chunk_tokens() windows, splicing the patch-embedding rows into whichever chunk(s) carry <\|image_pad\|> positions (pure-text chunks take the plain forward). Activation is bounded by the chunk, not the prompt. - Every rank runs the identical chunk sequence (chunk_size threaded through GenerateStepWithImages / TpForwardLogitsWithImages / generate_step_with_images), so the per-chunk AllReduces stay paired across ranks with no extra sync — the KV cache accumulates via the growing offset, only the last chunk's logits are kept. Pre-flight guard (validate_vision_prefill): even chunked, a long prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs the collective. Reject up front with a clean InsufficientVram when the estimated footprint exceeds free VRAM, so a doomed request fails fast instead of hanging the daemon. Heuristic + tunable (NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive so the now-working 12,960-token case still passes. Applied to every vision path (single-GPU + TP); single-GPU vision stays single-shot for now, so the guard is its protection until it's chunked too. Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size. The chunked forward is cuda-gated — CI CUDA type-check validates it. Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the daemon (needs restart); making a worker failure abort the leader's collective is separate, broader TP hardening. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 17:21:36 +03:00
rob thijssen	c8bcaabc38	fix(neuron): render HF chat templates via minijinja pycompat All checks were successful build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 34s Details CI / CUDA type-check (push) Successful in 39s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details CI / Test (push) Successful in 6m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 7m43s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 5m41s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details The Qwen3.6 chat_template.jinja (now loaded after the precedence fix) failed to render in minijinja: it uses Python str methods (content.startswith/endswith/split/rstrip/lstrip) and the raise_exception global that HF transformers patches into its Jinja env but minijinja doesn't provide. The render error tripped the text-only fallback, so image requests still produced zero <\|image_pad\|> tokens. Wire the standard bridge into render_chat_template: - minijinja-contrib `pycompat::unknown_method_callback` supplies the Python string/list/dict methods; - a `raise_exception` global maps to a render error (so malformed inputs — e.g. an image in a system message — surface cleanly). Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF cache) as a test fixture and assert it renders one <\|image_pad\|> for a text+image turn — the end-to-end check that would have caught this before deploy. Refs #16 / TP-vision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 16:32:23 +03:00
rob thijssen	7ad56c6a86	fix(neuron): load chat_template.jinja (transformers precedence) The chat-template loader only read the `chat_template` field from tokenizer_config.json. Qwen3.6-27B ships its vision-aware template only in a standalone `chat_template.jinja` (and has no tokenizer_config.json at all), so the loader returned None and image requests fell back to the text-only format_qwen3_prompt — rendering zero `<\|image_pad\|>` tokens and tripping "expand_image_pad_tokens: prompt has 0 image_token_id occurrences". load_chat_template_alongside now follows HF transformers precedence: standalone chat_template.jinja → chat_template.json → the chat_template field in tokenizer_config.json. Tests cover the precedence, the text-only fallback, and that an OpenAI image_url content part renders `<\|image_pad\|>` through the real template condition (`'image_url' in item`). Refs #16 / TP-vision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 16:25:30 +03:00
rob thijssen	1b0e36c119	fix(neuron): cover TpForwardLogitsWithImages in drain_poisoned match All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m48s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details CI / Test (push) Successful in 6m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m26s Details build-prerelease / Build neuron-ada (push) Successful in 5m21s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details The CUDA type-check caught a non-exhaustive match: drain_poisoned() must reply an error to every Job variant's reply channel, including the new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't see it — the variant is #[cfg(feature = "cuda")], so the match is exhaustive without it on CPU. Refs TP-vision plan Stage 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:26:46 +03:00
rob thijssen	ed2d09864e	feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill Some checks failed CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m51s Details CI / Test (push) Successful in 5m52s Details CI / CUDA type-check (push) Failing after 50s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now conditions on the image across both ranks. - TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image, populated at load via the shared VisionMeta::from_config_path (same config.json the shards loaded from; Stage 1 materialises the replicated tower on every rank). - LoadedHandle::capabilities() now advertises "vision" for TP loads with a tower (cortex-gateway already unions this into /v1/models via C3). - The TP rejection guards (chat_completion_tp + inference_tp_stream) are now conditional on !has_vision — text-only TP models still 400 cleanly, vision-capable ones fall through. - chat_completion_tp_inner and the streaming orchestration task detect images (request_has_images), expand <\|image_pad\|> to the per-image patch count, and run a single-shot generate_step_with_images prefill (every rank encodes + splices its replicated tower) before the unchanged decode loop. Text requests keep chunked_prefill_tp. - extract_image_data_uris ships the source data URIs to every rank for identical per-rank preprocessing. prompt_tokens now reflects the patch expansion, so usage accounting and KV offsets match the single-GPU baseline. TP entry points are cuda-gated (validated by CI's CUDA type-check); capabilities() + extract_image_data_uris + VisionMeta reuse compile on the non-cuda build. Full workspace test green. Refs TP-vision plan Stage 3. Implements #12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:14:44 +03:00
rob thijssen	4994b94c84	feat(neuron): TP-vision Stage 2 — per-rank image RPC + worker plumbing Carry image content through the TP forward path so every rank encodes and splices locally (replicated tower, no embedding broadcast). - rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source image data URIs + image_token_id for the single-shot vision prefill; worker still replies GenerateStepOk. Round-trip test added. - tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each preprocessed image through the rank's replicated tower, cat, splice, forward. Shared by leader and worker so every rank runs identical work. - tp/mod.rs: TpLeaderModel::forward_with_images and WorkerPool::generate_step_with_images (mirrors generate_step: fan out GenerateStepWithImages to subprocess ranks, run the leader's image forward on its device worker thread, drain, combine). - worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images — each subprocess rank preprocesses the same data URIs via the shared deterministic preprocess_data_uri, encodes, splices, forwards. - device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images. Determinism: every rank runs the same preprocess on the same source URIs through the same replicated tower, so the spliced hidden state matches across ranks — preserving the replicated-hidden-state invariant the row-parallel AllReduce relies on, with no NCCL broadcast. No caller yet — Stage 3 wires the TP chat/stream entry points to invoke generate_step_with_images for image prefill. cuda-gated plumbing covered by CI's CUDA type-check; rpc/route/forward_with_images compile on the non-cuda build. Refs TP-vision plan Stage 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:08:08 +03:00
rob thijssen	9a24b05866	feat(neuron): TP-vision Stage 1 — replicated vision tower on the TP model Load the full, unsharded model.visual.* vision tower on every TP rank (leader + each subprocess worker mmaps the same local safetensors) when config.vision_config is present. VisionTower::load already takes a ShardedVarBuilder whose plain .get() returns the full replicated tensor, so the tower loads identically regardless of world_size — no sharding, no NCCL broadcast. - TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id, plus has_vision/image_token_id/encode_image/forward_with_vision, mirroring the single-GPU Qwen3_5ForCausalLM wrapper. - TpQwen3_5Model::forward_with_vision mirrors the single-GPU forward_inner splice: embed locally, replace rows at image_token_id positions, run the sharded decoder stack. Because every rank encodes the same pixels through its replicated tower, the spliced input embeddings are identical across ranks — preserving the TP replicated-hidden-state invariant the row-parallel AllReduce relies on. - splice_runs is now pub(crate) and shared with the TP model. No caller yet — Stage 2 wires the RPC/worker path that invokes encode_image + forward_with_vision per rank. Most of this compiles on the non-cuda build (only the cuda load variant's tower line is gated); CI's CUDA type-check covers the rest. Refs TP-vision plan Stage 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:00:05 +03:00
rob thijssen	f8c0da0ebf	fix(neuron): TP-vision Stage 0 — reject image requests on the TP path Some checks failed build-prerelease / Resolve version stamps (push) Waiting to run Details CI / Format (push) Waiting to run Details CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Clippy (push) Has been cancelled Details CI / Test (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details The TP inference path has no vision tower, and the TP dispatch in chat_completion / inference_stream returns before the VisionUnsupported guard runs — so an image request to a TP-loaded model (e.g. beast's tp=2 Qwen3.6-27B) was silently dropped and answered from text alone, the exact issue-#3 confident-hallucination pattern Stage C killed for single-GPU. Add the request_has_images → VisionUnsupported guard to both chat_completion_tp and inference_tp_stream, before prefill / before the SSE stream opens, so beast returns a clean 400 vision_unsupported. The guard is unconditional for now (TP has no tower); Stage 3 makes it conditional on the TP model's has_vision once real TP-vision lands. Detection is covered by the existing request_has_images unit test; the guard itself is cuda-gated (validated by CI's CUDA type-check). Refs TP-vision plan Stage 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 14:53:56 +03:00
rob thijssen	dd592d918d	test(neuron): C2 — guard Responses→chat image translation contract All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 44s Details CI / Clippy (push) Successful in 2m51s Details build-prerelease / Build cortex binary (push) Successful in 4m42s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m52s Details CI / Test (push) Successful in 6m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The Responses request translator already emits the chat `image_url` Parts array Stage B5's vision path consumes, and the non-streaming (`chat_completion`) and streaming (`responses_stream` → `inference_stream`, Stage C1) Responses paths both route image content to the vision-aware prefill — so vision works end-to-end through `/v1/responses` with no translator change required. Add a multi-image test asserting order preservation and that the `detail` hint is tolerated (and dropped, since chat image_url has no analogue), locking the translator's output to the exact `image_url.url` shape `extract_images_from_request` walks. Closes part of #16 (Stage C2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:57:43 +03:00
rob thijssen	766c20ba47	feat(neuron): C1 — streaming SSE chat completion with vision The streaming worker path now splices image embeddings on prefill, closing the silent text-only degrade for `stream=true` image requests. `inference_stream` gains the same vision-routing block as the non-streaming `chat_completion`: detect `image_url` content, reject it against text-only models with `VisionUnsupported` (before any SSE frame is sent), preprocess each image and expand its `<\|image_pad\|>` sentinel to the per-image patch count, then carry the payload through dispatch. Rather than duplicate the 75-line `route_token!` reasoning/tool-call state machine into a sibling streamer, `stream_inference_via_worker` takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a single-shot `forward_logits_with_images` splice; when `None`, the original chunked text-only prefill. Image embeddings are prefill-only, so every decode step stays on the plain `forward_logits` path and the shared decode loop is untouched. This keeps exactly one copy of the tool-call/reasoning logic to maintain. The Responses API streaming path (`responses_stream`) inherits vision for free since it drives the same `inference_stream`. Unit test covers `request_has_images` (the shared routing gate); the real-weights SSE smoke is the manual curl on beast (cuda-integration). Closes part of #16 (Stage C1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:57:02 +03:00
rob thijssen	4972c7d1e7	feat(cortex-gateway): C3 — propagate vision capabilities through /v1/models ModelEntry and CortexModelEntry gain a `capabilities: Vec<String>` field (serde-default for back-compat). The poller copies it verbatim from each neuron's ModelInfo.capabilities; list_models computes the union across every node where a model is loaded so a checkpoint loaded text-only on one neuron and text+vision on another reports both to the fleet. Catalogue-only and mid-prewarm entries default to empty until the catalogue gains a capabilities declaration. Aliases inherit their target's capability union. New gateway test mocks two nodes with differing capability arrays and asserts the unioned /v1/models response. Closes part of #16 (Stage C3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:49:54 +03:00
rob thijssen	577781de8d	fix(neuron): derive Clone on ImageInput for the CUDA vision dispatch All checks were successful CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m47s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Test (push) Successful in 6m14s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m58s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 8m5s Details build-prerelease / Build neuron-ada (push) Successful in 8m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA type-check in CI failed on commit `24968e9` with E0308: error[E0308]: mismatched types --> crates/neuron/src/harness/candle.rs:1707:33 1707 \| images.clone(), \| ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`, found `&Vec<ImageInput>` In Stage B5 the cuda branch of `chat_completion` matches `&vision_route` to keep the `vision_route: Option<...>` alive for both arms, which makes `images` bind as `&Vec<ImageInput>`. The subsequent `images.clone()` call doesn't deep-clone because `ImageInput` doesn't derive `Clone` — rustc falls back to cloning the `&Vec` reference, which has the wrong type for the worker job. The CPU build (non-cuda) compiled fine because that branch is behind `#[cfg(feature = "cuda")]`; the cuda-check job is what catches the regression. Fix: derive `Clone` on `ImageInput`. The clone cost is one pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which is fine on the chat-completion hot path — vision requests are rare per second relative to text-only decode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 15:51:57 +03:00
rob thijssen	24968e9233	feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6 Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 33s Details CI / CUDA type-check (push) Failing after 46s Details CI / Clippy (push) Successful in 2m37s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m35s Details CI / Test (push) Successful in 6m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Failing after 7m46s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ada (push) Failing after 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Stage B of the vision plan (doc/vision-qwen3_6-spec.md). Wires the vision tower from Stage A through to a complete non-streaming chat completion: extract images from the request, preprocess, encode on the worker thread, splice embeddings into the LM input at `<\|image_pad\|>` positions, return coherent text response with `prompt_tokens` reflecting patch tokens. Closes the silent-drop class of failures from issue #3 — vision requests against Qwen3.6 now condition the model on the image instead of producing confident text-only hallucinations. Streaming for vision is Stage C. Deferred items tracked under #12 (TP-vision), #13 (27B production), #14 (dynamic resolution), #15 (numerical validation). What landed: - B1 — `Qwen3_5Model::forward_with_vision`: text-only `forward` unchanged; new method takes `(input_ids, offset, image_embeds, image_token_id)`, embeds tokens, locates `image_token_id` positions, splices via the new `splice_runs` helper. MRoPE applies text-positions to image tokens for Stage B (spatial MRoPE is the issue #15 numerical-validation follow-up). 2 unit tests for `splice_runs` covering contiguous + non-contiguous runs. - B2 — `ModelArch::forward_with_vision` dispatch: routes Qwen3_5Dense to the new method; other arches return an error. Defence-in-depth — the HTTP layer (B6) already rejects image content for non-vision models. - B3 — `Job::ForwardLogitsWithImages`: new worker variant carrying tokens + per-image `(pixels, c, h, w)` payloads. The dispatcher encodes each image (device-resident), concatenates the resulting embeddings, calls `arch.forward_with_vision`, and returns CPU logits. Image embeddings never copy back to CPU — the "tensors don't escape the worker" invariant from the per-device worker refactor still holds. Poisoned-worker drain path handles the new variant. - B4 — Prompt builder: - `request_has_images` detects image content cheaply. - `extract_images_from_request(request, profile)` walks `MessageContent::Parts`, decodes data URIs, runs `harness::preprocess::preprocess` per image, returns `Vec<ImageInput>` in request order. - `expand_image_pad_tokens(input_ids, image_token_id, patches_per_image)` walks the tokenized prompt and replaces each `<\|image_pad\|>` (id 248056 for Qwen3.6) with N copies matching the per-image patch count. 4 unit tests. - `VisionMeta::from_config_path` peeks `config.json` at load time for `image_token_id`, vision_config patch/merge sizes, and derives `lm_tokens_per_image` for the Stage B fixed resolution. - B5 — `chat_completion` vision routing: detects image content, validates the loaded model has vision, expands the prompt, and calls a new `run_inference_with_images_via_worker` helper that does single-shot prefill + standard decode loop (KV cache holds the post-splice hidden states from prefill, so decode steps don't re-splice). Stage B skips chunked prefill for vision — at 448×448 fixed resolution the budget stays well under the activation-memory threshold. Long-vision chunking is Stage D follow-up. - B6 — `InferenceError::VisionUnsupported`: structured 400 with `code=vision_unsupported, model_id, suggestion` when an image request hits a non-vision model. Closes the agent0 failure mode where vision requests degraded silently. - B7 — `ModelInfo.capabilities`: per-model array (`["text"]` vs `["text", "vision"]`) in `/v1/models` and forwarded verbatim by cortex-gateway. Lets clients (litellm, agent0) gate image_url submission on the declared capability set. Optional in the wire format; defaults to empty for older clients. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, 124 lib tests). New unit-test counts: +2 splice_runs, +4 expand_image_pad. Manual verification (after RPMs deploy on beast): curl http://hanzalova.internal:31313/v1/chat/completions \ -H 'Content-Type: application/json' \ -d "{\"model\":\"Qwen/Qwen3.6-27B\", \"messages\":[{\"role\":\"user\",\"content\":[ {\"type\":\"text\",\"text\":\"What's in this image?\"}, {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,...\"}} ]}], \"max_tokens\":120}" \| jq Expect prompt_tokens > 196 (text + 196 patch tokens) and a response that references actual image content. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 15:33:00 +03:00
rob thijssen	7df84fed8f	feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6 All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 28s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 5m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m23s Details build-prerelease / Build neuron-ampere (push) Successful in 7m56s Details CI / Test (push) Successful in 7m11s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ada (push) Successful in 5m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Stage A of the vision implementation plan (doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding that today's silent-drop failure mode (issue #3) needs — the Qwen3.6 ViT loads from `model.visual.`, runs forward producing post-merger LM-side image embeddings, and routes through the device worker via a new `Job::EncodeImage`. No LM splice yet — that's Stage B. Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision), #13 (27B production deploy), #14 (dynamic resolution), #15 (numerical validation). What landed: - A0 — investigation: pulled config.json, preprocessor_config.json, chat_template.jinja, and safetensors index from beast's local Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with exact tensor shapes for every `model.visual.` weight. Confirms 27-block ViT with `hidden_size=1152`, `patch_size=16`, `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives in 2 of the 15 safetensors shards. - A1 — deps + scaffolding: added `image = "0.25"` (default- features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to crates/neuron/Cargo.toml. Created `harness::preprocess` and `harness::arch::qwen3_5::vision` modules. - A2 — preprocess.rs: `decode_data_uri` strips `data:image/...;base64,...` → image bytes → `image::DynamicImage` (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess` resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448), normalises to `[-1, 1]` per the model's mean/std=0.5, emits row-major `(3, H, W)` f32. 9 unit tests covering data URI parse, decode failure paths, grayscale-to-RGB promotion, and the exact-value normalisation contract. - A3 — vision.rs: `VisionTower` struct with `patch_embed: Conv2d`, learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN + multi-head self-attention with fused QKV + GELU-tanh MLP + residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat → linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size). Includes the Conv3d→Conv2d fold trick documented at the top of the file — the published patch_embed.proj.weight is 5D `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static images we sum-collapse the temporal axis. Video would need real Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh` reference values from PyTorch. - A4 — wire vision into Qwen3_5ForCausalLM: extended `Config` with optional `vision_config: Option<VisionConfig>` and `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision tower when present, exposes `has_vision()` and `vision()` so the HTTP layer can advertise capability and so the encode path can reach it. - A5 — device worker `Job::EncodeImage`: new job variant carrying CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the tensor on the worker's device, calls `arch.encode_image(image)`, copies the result back to CPU as flat `Vec<f32>`. Keeps the "tensors don't escape the worker" invariant. Poisoned-worker drain path handles the new variant. - A6 — dispatch round-trip test: `encode_image_routes_to_dispatch_ and_errors_on_unknown_handle` proves the channel/dispatch wiring works end-to-end via the CPU device worker (errors on unknown ArchHandle, which is the expected behaviour without a loaded model — real-weights validation happens in Stage B when the LM splice path exists). CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, zero failures). New test counts: +9 in preprocess, +5 in vision, +1 in device_worker. Out of scope (deferred): - LM-side splice of image embeddings at `<\|image_pad\|>` positions → Stage B. - Streaming SSE for vision-bearing chat completions → Stage C. - Reject `image_url` with HTTP 400 for non-vision models / advertise `capabilities` in /v1/models → Stage C. - TP-vision (#12), 27B production deploy (#13), dynamic resolution (#14), numerical validation (#15). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 11:40:47 +03:00
rob thijssen	d0292ed377	feat(cortex): catalogue source field + scheme-qualified /models/load Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 40s Details CI / Test (push) Failing after 1m3s Details CI / Clippy (push) Successful in 2m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m13s Details build-prerelease / Build neuron-ampere (push) Successful in 7m31s Details build-prerelease / Build neuron-ada (push) Successful in 8m16s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m21s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Build cortex binary (push) Successful in 4m5s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Phase 3 of plan-source-aware-loader-preflight. Adds an optional `source` field to `ModelProfile` and threads it through the router's cold-load path so a profile pointing at the helexa registry forwards `helexa:<id>` to neuron's `/models/load` instead of leaving neuron to substitute its `default_source` (typically `huggingface`). Without this, an operator who declares `source = "helexa"` in models.toml would still see neuron fetch from HuggingFace — the catalogue → ModelSpec translation in `profile_to_spec` was dropping the scheme on the floor. What lands: - `cortex-core::catalogue::ModelProfile.source: Option<String>`. None is the default and preserves pre-Phase-3 behaviour. - `cortex-gateway::router::qualified_model_id(profile)` — small pure helper, extracted from `profile_to_spec` so it can be unit-tested. Empty-string `source` is treated as None so operators who blank out a previously-set value don't trip a scheme-with-no-scheme failure mode in neuron. - `models.example.toml` documents the new field with a commented-out helexa-scheme example pointing back at neuron.example.toml's matching sources block. Tests: - 2 new unit tests in `cortex-core::catalogue`: source-absent round-trip and source-present round-trip through TOML. - 3 new unit tests in `cortex-gateway::router`: pass-through when None, prefix when Some, pass-through on empty-string source. - ModelProfile literal in catalogue's existing test updated to carry `source: None`. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (24 test groups ok, zero failures). Completes Phase 3. With Phases 1+2+3 landed: - neuron parses `scheme:org/name`, routes per-source hf-hub Api with disambiguated cache. - preflight returns structured errors before any device allocation. - cortex catalogue declares per-model source jurisdiction and forwards it to neuron. The registry itself (registry.helexa.ai service, MinIO, nginx, mirror fabric) is the next moving piece — landing under a separate project per the design discussion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 14:53:58 +03:00
rob thijssen	d4e1b05956	feat(neuron,cortex-core): source-aware loader (scheme:org/name) All checks were successful CI / CUDA type-check (push) Successful in 46s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m40s Details build-prerelease / Build cortex binary (push) Successful in 4m23s Details CI / Test (push) Successful in 5m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m18s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Phase 1 of plan-source-aware-loader-preflight. Makes neuron's loader treat `huggingface:org/name` and `helexa:org/name` as first-class distinct sources with per-source endpoint + cache, while staying backwards-compatible with bare `org/name` ids. Zero behavior change for existing operator configs. Motivation: helexa is adding an EU-hosted registry (`registry.helexa.ai`) alongside HF. Both speak HF-compatible wire format, but the bytes, jurisdiction, trust root, and cache namespace are distinct. The loader needs to disambiguate which registry serves a given model id, and to keep their caches from colliding on disk when both happen to host the same `org/name`. What lands: - `cortex-core::source` — new module. `ModelSourceId { scheme, org, name }` with `FromStr` accepting both `scheme:org/name` and bare `org/name`. `Display` round-trips. `repo_path()` emits the `org/name` half for the hf-hub `Api::model(...)` call regardless of which scheme/endpoint we're hitting. Rejects malformed input with typed `ParseError` variants (empty scheme, missing slash, scheme with `/`, name with `:`, etc.). - `neuron::config::CandleHarnessConfig` gains `default_source: Option<String>` and `sources: HashMap<String, SourceConfig>`. `SourceConfig` mirrors what `hf_hub::ApiBuilder` consumes: endpoint URL, optional `auth_env` (env var name read at startup so secrets stay out of TOML), and optional cache_dir. Defaults synthesise a `huggingface` entry pointing at `https://huggingface.co` with the legacy `hf_cache` field as its cache_dir — so existing configs that only set `hf_cache` keep working unchanged. - `CandleHarness::new(bind_url, &CandleHarnessConfig)` replaces `CandleHarness::new(bind_url, hf_cache)`. Resolves every configured source's auth env var and cache dir up front so `hf_api_for(scheme)` is a pure HashMap lookup on the hot load path. Only the `huggingface` scheme gets the legacy `HF_HUB_CACHE`/`HF_HOME` env-var fallback chain; other schemes resolve to whatever the operator typed. - `hf_api()` -> `hf_api_for(scheme)`. Builds an `hf_hub::Api` with the source's endpoint, cache_dir, and auth token. Errors with a useful message naming the configured schemes when an unknown scheme is requested. - `CandleHarness::load_model` parses `spec.model_id` into a `ModelSourceId`, substitutes `default_source` for bare ids, and threads the parsed source through `preflight`, `resolve_files`, `resolve_dense_files`, `load_arch_gguf`, `load_arch_dense`, and `load_tp`. The hf-hub `Api::model()` call now uses `source_id.repo_path()` so registry calls hit the right URL shape regardless of scheme. - `preflight()` signature gains a `&ModelSourceId` parameter (it's the canonical id for log lines and error display); `RepoFetchFailed.model_id` etc. now carry the scheme-qualified form so operator-visible errors echo exactly what was configured. - `neuron.example.toml` documents the new `[harness.candle.sources.*]` table with commented-out examples for `huggingface` (explicit override) and `helexa`. Tests: - 13 new unit tests in `cortex-core::source` covering parse / display round-trip, default-scheme substitution semantics, and every `ParseError` variant. - 6 new unit tests in `neuron::config` covering the `effective_sources` synth (legacy `hf_cache` carry-through, explicit override preservation, helexa-alongside-huggingface) and `effective_default_source` fallback. - 2 new unit tests in `harness::candle::tests` covering multi-scheme `hf_api_for` routing, including the "unknown scheme" error path naming configured schemes. - Preflight integration tests updated to construct `ModelSourceId` and assert against the scheme-qualified error form. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 24 test groups ok, zero failures). Out of scope (Phase 3): - Cortex catalogue `source` field — independent of Phase 1+2, ships when the registry comes online. - `helexa` source endpoint itself — separate project; this PR adds the client-side rails only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 13:42:11 +03:00
rob thijssen	61adff347a	feat(neuron): preflight placement check with structured errors Some checks failed CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 48s Details CI / Test (push) Failing after 1m10s Details CI / Clippy (push) Successful in 2m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m53s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Phase 2 of plan-source-aware-loader-preflight. Adds a one-RTT placement feasibility check that runs before any device allocation, NCCL handshake, or weight fetch. Replaces today's opaque "fetch config.json … 404" failure mode (when an operator points `tensor_parallel = 2` at a GGUF-only repo) with a structured error that names the failure class and points at the fix. What lands: - `crates/neuron/src/harness/preflight.rs` — new module. Classifies a repo's siblings listing into `SourceFormat` (Gguf \| DenseSafetensors \| Mixed \| Empty), applies the tp/quant feasibility table, returns a `PlacementPlan` on success or a typed `PreflightError` on rejection. `PreflightError` is `serde::Serialize` so the HTTP layer can emit the structured shape verbatim; it's `thiserror::Error` so log lines get a single-line Display when downcasting from anyhow. Includes best-effort Levenshtein-nearest suggestion for malformed quant names (the second sharp edge the HauhauCS scenario surfaced — operator writes `q6k` against filenames containing `Q6_K_P`, and today's matcher just says "no GGUF file matching quant"). - `CandleHarness::load_model` — calls `preflight(...)` first thing after the "already loaded" guard, before any `ensure_device_worker` or `resolve_*`. Failure wraps the typed error in `anyhow::Error` so the existing trait surface is unchanged; the HTTP handler and the startup logger downcast to recover the structured form. - `crates/neuron/src/api.rs::load_model` handler — maps `PreflightError` to 422 Unprocessable Entity with `{"error": {"kind": "...", "model_id": "...", "suggestion": "..." }}`. Other failures keep the existing 400 + free-form `format!("{e:#}")` shape. - `crates/neuron/src/startup.rs::load_default_models` — when the failure is a preflight rejection, log as `reason=<kind> detail=<msg>` instead of the opaque `error=<chain>`, so journalctl on beast will now show `reason=tp_requires_safetensors detail="repo is GGUF-only (8 .gguf files); TP requires dense safetensors..."` instead of `error=fetch config.json from HauhauCS/...: 404 Not Found`. Tests: - 18 unit tests in `harness/preflight.rs` covering classifier, quant matching, Levenshtein, error serialization, and the full feasibility table (gguf+tp rejected, gguf+bad-quant suggests nearest, gguf+good-quant ok, dense+tp ok, empty rejected, mixed prefers safetensors). - 7 integration tests in `tests/preflight.rs` exercising the network path through an axum mock that serves hf-hub-compatible `/api/models/{org}/{name}/revision/main` payloads. Adds `tempfile` as a dev-dependency for per-test cache dirs. Out of scope (deferred to subsequent phases): - Phase 1 (source-aware loader plumbing — `scheme:org/name` parsing, per-scheme `SourceConfig`, cache disambiguation). Preflight runs against the single configured HuggingFace source today; the scheme threading lands cleanly when Phase 1 ships. - Phase 3 (cortex catalogue source field). - GGUF tensor-parallel loading. Preflight rejects this combination with `TpRequiresSafetensors`; the underlying loader gap is the separate `Helexa` curated-registry / heretic-rs conversation. Refs #4-#9 architectural follow-up; no specific issue closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 13:24:30 +03:00
rob thijssen	435fd10902	fix(neuron): macro-ify CUDA single-GPU route_token so DecodeStream type stays inferred All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 29s Details CI / Clippy (push) Successful in 2m47s Details build-prerelease / Build cortex binary (push) Successful in 4m27s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 8m30s Details build-prerelease / Build neuron-ada (push) Successful in 5m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m11s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m1s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details Prerelease build (run 270) failed on commit `cb30383` with: error[E0107]: struct takes 5 generic arguments but 0 generic arguments were supplied --> crates/neuron/src/harness/candle.rs:3554:41 \| 3554 \| decode_stream: &mut tokenizers::DecodeStream<'_>, \| ^^^^^^^^^^^^ The Step-2-era refactor for #6's tool-call extraction added a nested `async fn route_token` inside `stream_inference_via_worker` that named `tokenizers::DecodeStream<'_>` as a parameter type. `DecodeStream` actually has five generic parameters (`'tok, M, N, PT, PP, D`) which makes naming it explicitly painful — the working approach the CPU path uses is a macro, where the body expands inline at the call site and the decoder type stays inferred. This commit replicates the CPU-side macro for the CUDA worker path. Same shape, just with `.await` calls inside (macros tolerate that since they expand inline into the enclosing async context). Control flow uses a labelled-block + `consumer_alive` flag rather than `return` so the macro stays generic over the surrounding return type. The CPU build (default-feature workspace, what `clippy` and `test` jobs exercise) doesn't compile this `#[cfg(feature = "cuda")]` branch, which is why local CI green-lit it. The cuda-check job should catch this category of breakage now that #cb30383+CI-fix landed; this commit just resolves the actual breakage on the prerelease workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 08:59:56 +03:00
rob thijssen	cb303832bc	feat(neuron): render the model's chat_template with chat_template_kwargs Some checks failed CI / CUDA type-check (push) Failing after 58s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 40s Details build-prerelease / Build neuron-ampere (push) Failing after 1s Details CI / Clippy (push) Successful in 2m37s Details build-prerelease / Build cortex binary (push) Successful in 4m47s Details CI / Test (push) Successful in 6m13s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ada (push) Failing after 7m20s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Closes #9. Replaces the hardcoded `format_qwen3_prompt` ChatML glue with `minijinja`-driven rendering of the model's own `chat_template` from `tokenizer_config.json`. The request's `chat_template_kwargs` flow into the Jinja context so model-specific levers (Qwen3's `enable_thinking: false`, etc.) actually take effect. ## Implementation - New `harness::chat_template` module with three entry points: - `load_chat_template_alongside(tokenizer_json_path)` — probes `tokenizer_config.json` in the same hf-hub snapshot directory. Supports both the canonical string-form `chat_template` and the array-form some tokenizers ship (multi-template models). - `render_chat_template(template, messages, tools, kwargs)` — renders via `minijinja`. Messages flatten into the `[{role, content}]` shape HF templates iterate, with per-message extras (`tool_calls`, `tool_call_id`) preserved. `tools` and `kwargs` add into the Jinja context so templates that reference them work without us interpreting their shape. - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE` (default true). Falsy values force the fallback path everywhere — a kill switch for emergency rollback without a rebuild. - `LoadedModel.chat_template: Option<String>` and the TP equivalent are populated once at load time. `None` (no tokenizer_config.json, parse error, missing field) routes the fallback path silently; logs go through `tracing::debug`/`warn` per condition. - New `build_prompt_for_request(chat_template, request)` wraps the decision: when both the template is present AND the kill switch is off, render with kwargs from `request.extra` (looks up `chat_template_kwargs` and `tools` lazily). On render error → warn + fallback to `format_qwen3_prompt`. Wired into all four current prompt-build sites (single-GPU stream + non-stream, TP stream + non-stream). ## Dependency `minijinja = "2"` with the `builtins`, `json`, and `serde` features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used internally by HF's `tokenizers-rs` for its own chat templating; the API surface we touch (`Environment::add_template` + `Template::render(serde_value)`) is stable. ## Validation strategy I can't byte-compare the new path's output against `format_qwen3_prompt` for live models without GPU (CI doesn't have one). The fallback path and kill switch are the mitigations — a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the neuron service env if the chat template renders surprisingly on Qwen3-8B in production. The legacy formatter stays the fail-closed default. ## Scope cuts (documented in module header) - Tool-definition lifting from helexa-acp's system-prompt injection into the chat_template's native tools block is deferred. Today the request's `tools` array threads into the Jinja context, but helexa-acp continues to inject Hermes-format tool descriptions into the system prompt for backwards-compat with non-cortex endpoints. ## Tests 9 unit tests in `chat_template`: kill-switch matrix (truthy / falsy / unset), template loading (string form, array form, missing file, unparseable JSON, missing field), rendering (basic conversation threading, kwargs forwarding, message-extras threading for tool_calls). 215 workspace tests pass; clippy + fmt clean across all workspace features (default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:43:11 +03:00
rob thijssen	44008358c5	feat(neuron): emit response.in_progress between created and output_item.added Some checks failed build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 44s Details CI / Test (push) Failing after 1m5s Details CI / Clippy (push) Successful in 2m36s Details CI / CUDA type-check (push) Failing after 52s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m42s Details build-prerelease / Build neuron-ampere (push) Failing after 7m14s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Refs #7. OpenAI's Responses API spec emits `response.in_progress` between `response.created` and the first output-item event to mark "request validated, model is generating". Some Responses-API clients distinguish loading-spinner vs streaming-spinner UI based on which event arrived last; emitting both keeps the wire shape matched. Carries the same shell as `response.created` (status=in_progress, empty output, no usage yet) — both events are payload-light bookkeeping, distinguished only by the event name. The hosted-tool event families remaining in #7 (web_search_call, code_interpreter_call, file_search_call, image_generation_call) stay deferred until the underlying tools exist in neuron. Updated `full_stream_emits_expected_event_sequence` to assert the new event lands in position 1; downstream indexing shifted by one across the existing test assertions. CI green, fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:30:34 +03:00
rob thijssen	fc9a8c42a3	feat(neuron): extract `<tool_call>` blocks to structured tool_calls deltas Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Failing after 17s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 32s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Closes #6. Same model-agnostic seam as #8 but for tool-call markers (`<tool_call>` / `</tool_call>` on Qwen3-Coder, Hermes-format, DeepSeek-Coder, gpt-oss, …). Lets Zed's tool-use feature and any other vanilla OpenAI chat client get structured `tool_calls` deltas out of cortex without having to parse markers themselves. ## Implementation 1. Tokenizer probe at load time (`detect_tool_call_token_pair` in `wire::event`) — same shape as the reasoning-marker probe from #8. Both open AND close must resolve to single token ids; non-tool-use models get `None` and pass through unchanged. Stored on `LoadedModel.tool_call_tokens` and the TP analogue. 2. New `InferenceEvent::ToolCall` variant — carries `index` (call slot, per-turn counter), generated `id` (`call_<hex>_<idx>`), `name`, and the complete `arguments` JSON string. One event per parsed call. 3. Token-level state machine in all three streaming paths (CPU `run_inference_streaming`, CUDA single-GPU `stream_inference_via_worker`, CUDA TP `chat_completion_tp_stream`) layered on top of #8's reasoning routing: - `<tool_call>` token → enter buffering state, clear buffer. - Tokens while buffering → accumulate into `tool_call_buf` via the decoder (so multi-byte UTF-8 still buffers correctly) without emitting anything visible. - `</tool_call>` token → take the buffer, parse with `parse_tool_call_body` (extract `name` + `arguments`), emit a structured `ToolCall` event with a fresh `call_<hex>` id and the parsed fields. - On parse failure → fall back to re-emitting the original `<tool_call>{buf}</tool_call>` block as plain text content so helexa-acp's existing `ToolCallParser` repair passes still have a chance to recover the call. 4. OpenAI chat projector emits the OpenAI streaming `tool_calls` delta shape on `InferenceEvent::ToolCall` — `{tool_calls: [{index, id, type:"function", function:{name, arguments}}]}`. One chunk per call slot. 5. OpenAI Responses projector drops `ToolCall` events for now (Responses-side function_call event family routing tracked under #7); the chat path is what unblocks Zed's tool use today. ## Acceptance - Vanilla OpenAI chat clients (Zed's tool-use feature, any other OpenAI-compatible tool-call consumer) get structured tool_calls deltas against cortex+neuron without having to parse `<tool_call>` markers in content. - helexa-acp continues to work — when neuron parses cleanly, it consumes the structured deltas through its existing decoder. When the model emits malformed JSON, neuron falls back to text pass-through and helexa-acp's `ToolCallParser` recovers via the same path it always did. - Models without tool-call markers in their tokenizer pass through unchanged. - No hardcoded model knowledge — entirely driven by tokenizer metadata. ## Tests 2 new detection tests in `wire::event` (Qwen3-style marker detection, no-marker case). The streaming paths themselves stay covered by the existing chat-completions integration tests; full end-to-end exercise of the new path requires GPU-loaded models and lives outside the CI test surface. 215 workspace tests pass; clippy + fmt clean across the workspace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:26:31 +03:00
rob thijssen	7733eecba5	feat(neuron): strip reasoning from chat completions by default Some checks failed CI / CUDA type-check (push) Failing after 18s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details CI / Test (push) Successful in 5m19s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m56s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m45s Details build-prerelease / Build neuron-ada (push) Successful in 5m24s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Closes #8. Reasoning-capable models (Qwen3, DeepSeek-R1, gpt-oss, Mistral Magistral, …) emit `<think>...</think>` blocks inline in their content stream. The chat-completions wire format has no slot for reasoning, so until this change every consumer either parsed the markers themselves (helexa-acp) or wrote the raw scratchpad content into their UI (Zed's commit-message generator — visible as the leaked reasoning block on every generated commit message against benjy's Qwen3-8B). ## Implementation, model-agnostic by design The neuron side now does token-level routing without any hardcoded model knowledge: 1. At load time (`detect_reasoning_token_pair` in `wire::event`), probe the tokenizer's vocabulary for a known reasoning-marker pair: `<think>` / `</think>` (Qwen3, DeepSeek-R1, gpt-oss), `[THINK]` / `[/THINK]` (Mistral Magistral), and a couple of derivatives. Each marker must resolve to a single token id; if both open and close resolve, stash on `LoadedModel.reasoning_tokens` (similarly `TpLoadedModel`). Non-reasoning models get `None` and pass through unchanged. 2. At inference time, the three streaming paths (`run_inference_streaming` CPU, `stream_inference_via_worker` CUDA single-GPU, `chat_completion_tp_stream` CUDA TP) now check each sampled token against the pair via the new `handle_reasoning_marker` helper before feeding it to the detokeniser. Open marker → set `in_reasoning = true`, drop the marker. Close marker → unset, drop. Other tokens go through `emit_delta(_blocking)` which now picks `ReasoningDelta` or `TextDelta` based on state. Markers never appear in the streamed output. 3. In `wire::openai_chat`, the projector splits into: - `project_chat_stream` (unchanged signature; default behaviour — drops `ReasoningDelta`) - `project_chat_stream_with(rx, …, ChatProjectionConfig)` — when `include_thinking: true` and `reasoning_markers: Some(_)`, re-wraps reasoning content with the literal open/close marker text and emits as content deltas. Preserves the on-the-wire shape that helexa-acp's `ThinkParser` expects. 4. HTTP handler reads `x-include-thinking: true` (case- insensitive `1`/`true`/`yes`) from the request headers and threads it into the projection config. cortex-gateway already forwards arbitrary headers verbatim, so the opt-in works end-to-end without gateway changes. 5. helexa-acp's `openai_chat` provider sets `x-include-thinking: true` on every request so its existing `ThinkParser` keeps receiving the marked content stream. `ThinkParser` itself is unchanged — needed for endpoints that aren't reasoning-aware (OpenRouter, OpenAI directly, etc.). ## Acceptance - Zed's commit-message generator (vanilla chat-completions client, no `x-include-thinking`) gets clean commit messages with no `<think>` block. - helexa-acp sessions continue to render thinking in Zed's thought UI via the opt-in path. - Models without reasoning tokens declared in their tokenizer pass through unchanged. - Implementation contains zero references to "qwen3" or any specific model — entirely driven by tokenizer metadata. ## Tests 9 new tests in `wire::event` (token-pair detection across 4 marker conventions, edge cases) and `wire::openai_chat` (default drop, opt-in re-wrap with multi-chunk reasoning, close-marker on Finish, fallback when markers absent, off-switch with markers present). All 213 workspace tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 17:55:04 +03:00
rob thijssen	fdc0adb738	docs(helexa-acp): README + example config for end-user onboarding Some checks failed CI / CUDA type-check (push) Failing after 18s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details CI / Test (push) Successful in 5m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m40s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Stage 7. Walks a new user from "never heard of helexa-acp" to "chatting via Zed against helexa or a public API in 10 minutes": - crates/helexa-acp/README.md — install (from source / COPR), quick-start env-var path, multi-endpoint TOML, full Zed setup, endpoint cookbook (cortex/neuron, OpenAI, Anthropic, OpenRouter, LM Studio, multi-cortex), three session modes (Default / Bypass / Plan) with their tool tables, tool surface + path-handling rules, session resume, context compaction, troubleshooting for the five failure modes a new user is likely to hit, and architecture reference for contributors. - helexa-acp.example.toml — copy-paste-and-edit starter config at the repo root, mirroring the existing cortex.example.toml / neuron.example.toml pattern. No code changes. fmt + clippy clean as a sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:25:56 +03:00
rob thijssen	8fa1d1962e	feat(helexa-acp): anthropic-messages provider Some checks failed CI / CUDA type-check (push) Failing after 18s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Test (push) Failing after 59s Details CI / Clippy (push) Successful in 2m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m50s Details build-prerelease / Build neuron-ada (push) Successful in 5m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Stage 6b. Third provider impl, completing the wire-format trio (openai-chat, openai-responses, anthropic-messages). Lets a helexa-acp endpoint configured with `wire_api = "anthropic-messages"` drive Claude models — either against Anthropic directly or via cortex's /v1/messages translation surface. ## Encoder (CompletionRequest → Anthropic body) - System messages flatten to the top-level `system` field (concatenated with blank lines when there are multiple). - User text → `{role:"user", content:"..."}`. - User MultiPart (text + images) → `content` array with Anthropic's distinct image shape: `{type:"image", source:{type:"base64", media_type, data}}` — structurally different from OpenAI's `image_url` data URI. - Assistant text → `{role:"assistant", content:"..."}`. - Assistant tool_calls → `content` array with optional `{type:"text"}` block plus one `{type:"tool_use", id, name, input:<parsed json>}` per call. The internal arguments JSON string is parsed back to a Value before encoding (Anthropic requires the parsed form); malformed JSON falls back to a String input so the request body still serialises. - Tool result → `{role:"user", content:[{type:"tool_result", tool_use_id, content}]}` per Anthropic's convention (no separate `tool` role). - `max_tokens` is required by Anthropic; defaults to 8192 when the request doesn't specify. ## Decoder (Anthropic SSE → CompletionEvent) Named SSE events: - `message_start` → captures input_tokens from `usage` for the eventual UsageStats. - `content_block_start` (type=text) → TextDelta (initial text, if any). - `content_block_start` (type=tool_use) → ToolCallStart; if a pre-buffered `input` is present, also emits a single ToolCallArgsDelta. - `content_block_start` (type=thinking, for extended-thinking models) → ReasoningDelta. - `content_block_delta` (text_delta) → TextDelta. - `content_block_delta` (input_json_delta) → ToolCallArgsDelta, correlated by block index. - `content_block_delta` (thinking_delta) → ReasoningDelta. - `message_delta` → Usage (final output_tokens) + Finish with stop_reason mapped: end_turn/stop_sequence → "stop", max_tokens → "length", tool_use → "tool_calls". - `message_stop` → stream terminates. - `ping` ignored (Anthropic's keep-alive). - `error` → yields Err and ends the stream. ## Wiring - Authentication: `x-api-key` + `anthropic-version: 2023-06-01` headers (not Bearer). Both ship when api_key is configured; servers that don't care (cortex) ignore them. - `WireApi::AnthropicMessages` in build_provider now constructs the provider instead of erroring "reserved for future". - `provider::mod.rs` registers the new module. 18 new unit tests: encoder (system collapse, multi-system concat, default max_tokens, multipart with image, tool_use blocks, tool results, malformed JSON arg fallback), decoder (text streaming, tool_use lifecycle, max_tokens→length mapping, empty deltas, ping events, error events, cancellation, malformed payload skip, thinking blocks). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:01:59 +03:00
rob thijssen	1818dfb337	feat(helexa-acp): openai-responses provider Some checks failed CI / Format (push) Successful in 38s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Clippy (push) Successful in 2m35s Details CI / CUDA type-check (push) Failing after 12s Details CI / Test (push) Successful in 5m54s Details build-prerelease / Build cortex binary (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m36s Details build-prerelease / Build neuron-ampere (push) Successful in 7m11s Details build-prerelease / Build neuron-ada (push) Successful in 6m33s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Stage 6a. Implements the `Provider` trait for OpenAI's Responses API surface, parallel to the existing `OpenAIChatProvider`. Lets a helexa-acp endpoint configured with `wire_api = "openai-responses"` drive a `/v1/responses` server (today: neuron through cortex; later: OpenAI directly) using the same agent-loop machinery the chat provider already supports. ## Encoder (CompletionRequest → Responses body) - System messages collapse into a single top-level `instructions` string. Multiple system messages concatenate with blank lines so ordering is preserved. - User messages become `{type:"message", role:"user", content:…}` input items. Text content stays a bare string; MultiPart content (text + images, post-Stage 5) becomes a `[{type:"input_text"}, {type:"input_image"}]` array with images encoded as `data:{mime};base64,{data}` URIs — exactly the shape neuron's `wire::openai_responses::request_to_chat` accepts. - Assistant text turns become an `output_text` content part inside a `message` item. - Assistant tool-call turns become `function_call` input items. - Tool result turns become `function_call_output` input items. - `max_tokens` translates to `max_output_tokens`. ## Decoder (Responses SSE → CompletionEvent) Reads named events on the SSE `event:` line: - `response.output_text.delta` → `CompletionEvent::TextDelta` - `response.output_item.added` with `type:"function_call"` → `CompletionEvent::ToolCallStart` (and, when the upstream pre-buffers fully, a single `ToolCallArgsDelta`) - `response.function_call_arguments.delta` → `CompletionEvent::ToolCallArgsDelta`, correlated back to the tool-call slot by output_index. - `response.completed` → `CompletionEvent::Usage` (if present) + `CompletionEvent::Finish` with reason mapped from `status`: `"completed"` → `"stop"`, `"incomplete"` → `"length"`. - Bookkeeping events (`response.created`, `response.in_progress`, `.content_part.`, `.output_text.done`, `.output_item.done`, `.function_call_arguments.done`, reasoning_) are skipped. ## Wiring - `EndpointConfig::responses_url()` joins `{base_url}/responses`. - `WireApi::OpenAiResponses` in `build_provider` constructs the new provider (was previously a "reserved for future" error). - `provider::mod.rs` registers the new module. ## Cuts (carried over from neuron-side issues) - The decoder's `ToolCall` handling fires correctly when the upstream emits `function_call` items, but the neuron candle harness doesn't yet (Refs #6). Real tool-call testing against cortex+neuron stays on the chat path until #6 lands. - Reasoning events (`response.reasoning_`) are deliberately dropped today; once neuron emits `InferenceEvent::ReasoningDelta` (Refs #5) the projector on the neuron side will start firing the reasoning event family and this decoder will need a matching case to route them to `CompletionEvent::ReasoningDelta`. 13 new unit tests cover encoder (system collapse, multipart user input, assistant output_text encoding, tool-call round-trip via function_call items) and decoder (text streaming, empty deltas dropped, length finish, function_call lifecycle, inline-arguments shape, cancellation, malformed payload skip). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:30:25 +03:00
rob thijssen	5ed1140c97	feat(cortex-gateway): proxy /v1/responses to neuron Some checks failed CI / CUDA type-check (push) Failing after 12s Details build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Failing after 1m5s Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details CI / Test (push) Successful in 5m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Step 3 of the Responses rollout: plain proxy route on the gateway, no translation. Neuron speaks the Responses API natively after Step 2 (commit `957f704`), so the gateway just needs the same routing shape it uses for /v1/chat/completions — extract `model`, resolve via router::resolve, forward verbatim. - New `POST /v1/responses` handler in handlers.rs::responses. - Mock neuron under tests/common/mod.rs gains a `/v1/responses` endpoint that mirrors the ResponsesResponse shape neuron emits. - New integration test file `tests/responses.rs` exercises: - Happy path (200, body round-trips, ResponsesUsage shape). - Unknown model → 404 (matches chat-completions error shape). - Missing `model` field → 400 (same extract_model helper). Streaming proxy works through the same path as chat completions — the upstream Content-Type (`text/event-stream` for stream:true, `application/json` otherwise) propagates through proxy_with_metrics unchanged. Live-stream integration tests against a streaming mock deferred until we exercise the path against a real neuron, since the chat-completions streaming test already covers the proxy's SSE forwarding mechanics. Three new tests; clippy + fmt clean across the workspace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:21:43 +03:00
rob thijssen	957f704efa	feat(neuron): OpenAI Responses API + ci cuda-check runner label Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / CUDA type-check (push) Failing after 11s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text \| items), `ResponsesInputItem` (message \| function_call \| function_call_output \| reasoning), `ResponsesContentPart` (input_text \| input_image \| output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit `1859777`: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:13:44 +03:00
rob thijssen	6927286cab	fix(neuron): clone id/model_id before TP spawn so wire projector can use them Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 39s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m34s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build cortex binary (push) Successful in 5m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 7m38s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The Step 1 refactor moved the InferenceEvent receiver wrap to after the orchestration spawn in chat_completion_tp_stream, but the spawn moves both `id` and `model_id` into its async closure (used heavily by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck error E0382 use-of-moved-value on the wire_chat::project_chat_stream call. The non-CUDA build doesn't exercise this branch (it lives behind `#[cfg(feature = "cuda")]`) which is why the workspace clippy/test gate passed locally and on the regular CI workflow. The RPM build workflow, which compiles with --features cuda, caught it (run 244 jobs 2/3/4 against beast / ampere / ada respectively, all the same error). Fix: snapshot `id` and `model_id` into `projector_id` / `projector_model_id` before the spawn, use those at the projector call site. The originals stay free to be moved into the closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 09:37:10 +03:00
rob thijssen	302ccfb982	refactor(neuron): introduce InferenceEvent + wire projection layer Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 3m28s Details build-prerelease / Build neuron-blackwell (push) Failing after 6m4s Details build-prerelease / Build neuron-ampere (push) Failing after 7m20s Details CI / Test (push) Successful in 7m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Failing after 4m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Step 1 of the OpenAI Responses API rollout. Pure refactor — no new endpoints, no behaviour change on the wire. Lays the seam for emitting Responses-shaped streaming events from the same harness output as chat completions in Step 2. - New `neuron::wire` module tree: - `wire::event::InferenceEvent` — format-agnostic enum (Start, TextDelta, ReasoningDelta, Finish) the candle harness now emits as its native streaming currency. - `wire::event::FinishReason` — typed reason that maps cleanly onto OpenAI `finish_reason`, OpenAI Responses `status`, and Anthropic `stop_reason` strings. - `wire::openai_chat::project_chat_stream` — async task that consumes an InferenceEvent receiver and produces a ChatCompletionChunk receiver, stamping per-request metadata (id, created, model_id) onto every chunk. Output matches the pre-refactor wire shape bit-for-bit. - candle.rs refactored to emit InferenceEvent on its internal channel through all three streaming paths (CPU run_inference_streaming, CUDA single-GPU stream_inference_via_worker, CUDA TP chat_completion_tp_stream). The streaming functions lost their id/created/model_id parameters since wire-format metadata now lives in the projector. - emit_delta + emit_delta_blocking simplified to single-purpose TextDelta emitters with no wire-format coupling. - chat_completion_stream wraps the InferenceEvent receiver in wire_chat::project_chat_stream before returning so the /v1/chat/completions HTTP handler keeps consuming ChatCompletionChunks unchanged. External signature preserved. Also fixes a pre-existing helexa-acp test race (three modules each declared their own static LOCK for HOME mutation, so cross-module parallelism flaked tests that read HOME at runtime). Consolidated onto a single crate-wide path_util::ENV_LOCK. 122 helexa-acp tests + 44 neuron tests pass (5 new wire projection tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp suite 3x to confirm the env race is closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:30:17 +03:00
rob thijssen	df0abfe4d4	feat(helexa-acp): image input for vision-capable models All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m33s Details CI / Test (push) Successful in 5m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m2s Details build-prerelease / Build neuron-ampere (push) Successful in 7m49s Details build-prerelease / Build neuron-ada (push) Successful in 5m27s Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Stage 5. Zed clipboard/DnD images get forwarded as OpenAI content-array messages on user turns. - New MessageContent::MultiPart variant + MessagePart (Text\|Image) + ImageData struct (mime_type, base64 data, optional uri). - flatten_prompt now produces structured content: collapses to Text when every block is text (some upstreams treat array-form as vision-only and refuse on text-only models), otherwise produces MultiPart preserving block order. - OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url", image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user messages. Data URIs are used over remote `uri` because they round-trip through every upstream we care about. - prompt_capabilities.image = true at initialize so Zed actually sends image blocks. - compaction estimates ~512 tokens per image (the middle of the Qwen3-VL / OpenAI detail range) so the budget tracker doesn't pretend images are free. - session/load replays image-bearing user turns by surfacing the text parts verbatim and rendering each image as a "[image: {mime} ({n} bytes)]" placeholder chunk — Zed can show the prior text context even though re-uploading the bytes through ACP isn't meaningful for resume. - 4 new tests: flatten produces MultiPart in block order, image-only prompts still flatten to MultiPart, encoder emits the correct array shape, text-only encoding stays as the string form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 09:43:00 +03:00
rob thijssen	b9016571f6	feat(helexa-acp): expand ~ / $HOME and fall back to local fs on ACP read errors Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 50s Details CI / Clippy (push) Successful in 2m34s Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details CI / Test (push) Successful in 5m13s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 8m15s Details build-prerelease / Build neuron-ada (push) Successful in 5m23s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Two related polish fixes for daily use: - New `path_util` module expands `~`, `~/…`, `$HOME`, and `$HOME/…` prefixes in every tool that takes a path (read_file, write_file, edit_file, list_dir, bash cwd). The expansion is also applied to the plan-mode write gate so `~/.local/share/helexa-acp/plans/…` comparisons behave correctly regardless of which form the model emits. - `read_file` now falls back to `std::fs::read_to_string` when ACP's `fs/read_text_file` errors out. Zed's workspace-scoped read was the source of "model can't see ~/git/architecture/generic.md" when the session cwd is a different project; the fallback lets the agent pull in shared material that lives outside the active workspace, the same way `list_dir` already does via local `std::fs::read_dir`. Local fallback honours line/limit args. The fallback also produces a combined error message when both ACP and local-fs reads fail, so the model sees what actually broke rather than just the ACP-side error. 14 new unit tests cover path_util's prefix matrix, fallback success/failure paths, and the line/limit slicing in fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 09:28:58 +03:00
rob thijssen	adbc52bfcd	feat(helexa-acp): model picker + session/set_model handler All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m32s Details build-prerelease / Build cortex binary (push) Successful in 4m45s Details CI / Test (push) Successful in 5m52s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 7m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 4m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Stage 4. Zed's model dropdown now lists every model from every configured endpoint, and switching it routes the next prompt to a new endpoint+model. - Enable `unstable_session_model` on the agent-client-protocol dep so SessionModelState / SetSessionModelRequest / ModelInfo are available. - Agent::new becomes async and calls Provider::list_models on every provider at startup; per-endpoint failures warn-and-skip instead of aborting the agent. - With a single endpoint configured, model ids appear bare; with multiple endpoints every id carries the `endpoint:` prefix so the picker is unambiguous and parse_model_selector routes correctly. - NewSessionResponse and LoadSessionResponse attach SessionModelState with the session's current model id + the aggregated catalogue. - session/set_model: validates the requested model id against resolve_provider, mutates session.model_id, and persists so the on-disk transcript reflects the new model. Three new aggregate_models tests cover the prefixing rule (bare vs multi-endpoint) and warn-and-skip on a failing endpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 09:10:16 +03:00
rob thijssen	537a0fe7f2	feat(helexa-acp): context compaction for small-context local models All checks were successful build-prerelease / Resolve version stamps (push) Successful in 26s Details CI / Format (push) Successful in 29s Details CI / Clippy (push) Successful in 2m26s Details build-prerelease / Build cortex binary (push) Successful in 5m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m51s Details CI / Test (push) Successful in 5m53s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m58s Details build-prerelease / Build neuron-ada (push) Successful in 5m30s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details A new src/compaction.rs module projects rolling conversation history into a token budget before each completion. Older tool results and assistant prose get elided to one-line markers; system prompts, user turns, and the last KEEP_TAIL=4 messages stay verbatim. tool_call_id pairing is preserved so OpenAI strict-schema providers keep working. Driven by a new per-endpoint `context_window` config field (also HELEXA_ACP_CONTEXT_WINDOW for the env-only single-endpoint case). When set, prompt budget = context_window - max_tokens - 512_safety; when unset, behaviour is unchanged. Without this, a 32 K Qwen3 dies with `prompt_too_long` after the first few read_file results pile up in history — the symptom seen in plan-mode dogfooding on beat. 10 new unit tests cover the compaction strategy and the prompt budget arithmetic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 08:22:01 +03:00
rob thijssen	cbadfcf112	feat(helexa-acp): plan mode — third session mode for read-and-plan-only flows Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m44s Details CI / Test (push) Successful in 5m3s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m37s Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m32s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Plan mode is the most restrictive of the three session modes: bash is disabled outright, writes are confined to a per-project plan directory under $XDG_DATA_HOME/helexa-acp/plans/<basename>-<8hex>/, and reads / list_dir are unrestricted. The system prompt is rebuilt at the top of every round so a mid-turn switch into (or out of) plan mode takes effect on the next streaming round, and plan mode appends a 3-option menu instructing the model to stop and let the user pick how to proceed once the plan is complete. The project id is basename + FNV-1a-32 of the cwd so it stays stable across runs (SipHash's DefaultHasher reseeds per process), while still disambiguating multiple checkouts that share a final path component. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 08:06:25 +03:00
rob thijssen	3ecbb21ece	fix(helexa-acp): persist per round, cancel previous prompt, log loop All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m32s Details CI / Test (push) Successful in 5m8s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 8m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m18s Details build-prerelease / Build cortex binary (push) Successful in 16m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m15s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Three changes addressing "session stops mid-turn and disk store doesn't update": 1. Per-round persistence. drive_prompt previously called store::save() once at the very end of the turn. If the loop stalled in a later round (long-running bash, upstream SSE that never finished, wedged ACP roundtrip), earlier successful rounds lived only in the spawned task's `new_turns` and never reached disk. Move the extend-history + save into a helper (extend_and_persist) and call it at the end of every loop iteration. The post-loop save catches whatever the break paths leave behind. Failure is logged not propagated. 2. Cancel previous in-flight prompt on new session/prompt. The handler used to overwrite SessionState.cancel with a fresh token without firing the old one. A wedged prior prompt would then live forever, holding session-state references and never persisting. Now we fire the existing cancel under the lock before installing the new token — the old task observes is_cancelled() on its next .await and unwinds. 3. Per-round and per-tool log lines. drive_prompt now emits: - INFO prompt round: streaming { round, of, history_turns } - INFO dispatch tool { tool, tool_call_id } - INFO dispatch tool complete { tool_call_id, is_error } - INFO prompt round complete; persisting { round, turns } - INFO prompt complete { stop_reason } so the next hang shows up by line number in /tmp/helexa-acp.log instead of as silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 16:29:22 +03:00
rob thijssen	0d841a4981	feat(helexa-acp): replay session history on session/load Some checks failed CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 48s Details CI / Test (push) Failing after 1m19s Details CI / Clippy (push) Successful in 2m56s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m52s Details build-prerelease / Build neuron-ampere (push) Successful in 7m49s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details session/list and session/load were both implemented but clicking a session in Zed's thread picker still left the agent panel empty. Zed (and ACP clients in general) doesn't cache the transcript for custom agent_servers entries — it only owns conversation state for first-party agents. For custom agents the expectation is that session/load returns successfully and the agent then re-emits the conversation as a stream of session/update notifications so the client can rebuild its view. Implement that replay path: - handle_load_session now returns (LoadSessionResponse, Vec<Message>) so the caller has the history available after the in-memory hydration finishes. - The session/load closure responds to the request first, then spawns a task that calls replay_history off the dispatch loop. - replay_history walks the persisted history and emits one session/update per turn: Role::User → UserMessageChunk(text) Role::Assistant text → AgentMessageChunk(text) Role::Assistant tool → AgentMessageChunk for any accompanying text + one ToolCall card per call (with kind/title/raw_input rendered the same way as the live dispatch path) Role::Tool result → ToolCallUpdate matching the assistant's call id, status: Completed, content set to the result text Role::System → skipped (system prompts aren't shown) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 16:02:00 +03:00
rob thijssen	0bbb9b752d	feat(helexa-acp): session/list so Zed can discover sessions to resume All checks were successful build-prerelease / Resolve version stamps (push) Successful in 28s Details CI / Format (push) Successful in 28s Details CI / Clippy (push) Successful in 2m45s Details build-prerelease / Build cortex binary (push) Successful in 4m41s Details CI / Test (push) Successful in 4m58s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m36s Details build-prerelease / Build neuron-ada (push) Successful in 5m40s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Stage 3b only implemented the trailing half of resume: write sessions to disk + handle session/load. But Zed (and any ACP client) needs `session/list` to discover which session belongs to the workspace it's reopening — without it, the client only knows how to mint new sessions and resume never fires even though the JSON sits ready on disk. Add the missing pieces: - store::list / list_in_dir — enumerate {id}.json under sessions_dir(), optionally filter by cwd, sort recent-first. Skips unparseable files with a warn rather than aborting. - store::unix_to_iso8601 — RFC 3339 formatter for SessionInfo.updated_at; pulls chrono in directly (already in the dep tree transitively). - agent::handle_list_sessions — wires the request to the store, builds SessionInfo entries with derived titles (first user turn, truncated to 60 chars). - agent::initialize_response — advertise session_capabilities.list = {} alongside the existing load_session: true. Verified end-to-end against the user's real hxa-1.json (60-turn beat conversation): `session/list` returns the entry with cwd, derived title, and ISO 8601 timestamp. 4 new store unit tests for list filtering, missing-dir handling, unparseable-file skipping, and ISO 8601 formatting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 14:34:41 +03:00
rob thijssen	5aac1ffc59	feat(helexa-acp): session resume via session/load All checks were successful CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m37s Details CI / Test (push) Successful in 4m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m35s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Build neuron-ampere (push) Successful in 7m45s Details build-prerelease / Build neuron-ada (push) Successful in 5m31s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Zed restarts (frequent during helexa-acp dogfooding) used to lose every conversation because we'd ignore the load_session capability and treat every project-reopen as a fresh session/new. Persist sessions to disk and honour session/load so the agent panel comes back where it left off. Storage layout: $XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json Each file holds session_id, cwd, model_id, mode_id, full Message history, plus created/updated timestamps. Atomic save via tempfile+rename so a crash mid-write can't corrupt the store. Touch points: - src/store.rs (new) — sessions_dir() resolution, save/load via default and explicit-dir entry points (so unit tests don't have to race on XDG_DATA_HOME). 5 unit tests cover round-trip, not-found errors, atomic overwrite, tool-call/result preservation, and the filename sanitiser's path-traversal handling. - src/provider/mod.rs — Serialize/Deserialize on Role, Message, MessageContent, ToolCall. MessageContent::Text turned into a struct variant ({text: ...}) so internally-tagged JSON works. - src/agent.rs — initialize_response advertises load_session: true; handle_load_session reads the file, snapshots in-memory state, returns LoadSessionResponse with the persisted mode preselected; drive_prompt persists at the end of every prompt round under the session lock with the I/O outside the lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:34:42 +03:00
rob thijssen	ec2b6450b2	feat(helexa-acp): infer tool name from arg shape when model omits it Some checks are pending build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m33s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Test (push) Successful in 5m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m40s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m33s Details build-prerelease / Package cortex RPM (push) Successful in 8m20s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details Qwen3.6-27B occasionally emits a <tool_call> body with the right arguments but no top-level `name` field — observed in the field as mkdir-style bash calls like {"arguments":{"command":"mkdir -p .../doc/plan/{01-discovery,...}"}} with no `name`. The agent had no tool to dispatch and surfaced a Failed card; the model would then hang or retry the same shape. Add a shape-based inference layer: - tools::infer_tool_name(arguments) — given an `arguments` object alone, return Some(name) when the key set uniquely identifies one tool: `{command}` or `{command,cwd}` → bash, `{path,content}` → write_file, `{path,old_text,new_text}` → edit_file. Ambiguous shapes (`{path}` alone — could be read_file or list_dir) return None so the agent still emits a Failed card rather than guessing. - agent::try_repair_missing_name(raw) — parses a malformed body, applies infer_tool_name, returns (name, args_json) on success. - drive_prompt sweeps malformed_calls through this repair before the Failed-card path. Recovered calls go into tool_buckets at the next free index and dispatch through the normal tool loop. 10 new unit tests in tools::tests cover the inference table plus the verbatim mkdir failure from the field log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:14:50 +03:00
rob thijssen	a494c8d43c	feat(helexa-acp): repair malformed tool calls and render failures as cards Some checks failed build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 28s Details CI / Format (push) Successful in 4m7s Details CI / Test (push) Failing after 1m2s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details CI / Clippy (push) Successful in 2m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-ampere (push) Successful in 8m18s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ada (push) Successful in 5m23s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Two related fixes for cases where Qwen3 sometimes emits slightly-off JSON inside <tool_call> blocks: 1. JSON repair pass in qwen3::parse_tool_call_body — strip up to three trailing extra `}` characters (model overshoots its closing braces), and hoist `name` out of `arguments` when it lands nested instead of as a sibling. Both observed in the field; both trivially repairable; both now dispatch as normal tool calls instead of falling back to the malformed path. 2. New CompletionEvent::MalformedToolCall variant for the cases repair can't fix. decode_stream now emits it instead of wrapping the raw body in a TextDelta, and agent.rs surfaces each one as a Failed SessionUpdate::ToolCall card (so Zed renders it as a structured failure UI element rather than dumping the body inline) plus a synthetic tool-call/tool-result history pair so the model gets clear feedback for self-correction on the next round. Empty <tool_call></tool_call> blocks are now a no-op too (no Malformed event), matching the existing empty-<think> behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:58:51 +03:00
rob thijssen	abbedf8d8a	chore(neuron): bump default max_tokens from 512 to 8192 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 45s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m35s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details 512 is too low for any modern coding model — clients that don't explicitly set max_tokens get clipped responses with no diagnostic. Bump the fallback at all four inference call sites (single-GPU streaming + non-streaming, TP leader + non-leader) to 8192, which fits comfortably within Qwen3-class context windows after a typical agent prompt and lines up with what helexa-acp / a0 / curl clients reasonably expect. Clients that explicitly set max_tokens (now including helexa-acp via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:38:28 +03:00
rob thijssen	6cc14e925c	feat(helexa-acp): per-endpoint max_tokens config Some checks failed CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Failing after 1m3s Details CI / Test (push) Failing after 1m4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The agent was sending max_tokens: None, letting cortex/neuron pick its own default — which trips Zed's "Output Limit Reached" on long turns. Add a per-endpoint max_tokens option in EndpointConfig (TOML key and HELEXA_ACP_MAX_TOKENS env var for the single-endpoint fallback) that the agent threads into every CompletionRequest by endpoint name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:34:23 +03:00

1 2 3

150 Commits