cortex

Files

fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic

When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji)
across multiple tokens, the previous "decode the cumulative token list,
byte-slice the delta against a stored prefix" pattern would panic with
'start byte index N is not a char boundary; it is inside <emoji>'.

The race: at step N the tokenizer renders the partial bytes as U+FFFD
(3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4
bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the
codepoint in step N+1's `full` string, and `&str[start..]` panics.

Replace with tokenizers' `DecodeStream::step(id)` which maintains an
internal byte buffer across token boundaries and only emits when a
clean codepoint completes. Applied at all three SSE emission sites:

  - stream_inference_via_worker (single-GPU CUDA stream)
  - chat_completion_tp_stream's spawned task (TP stream)
  - run_inference_streaming (CPU stream)

The shared emit helper splits into emit_delta (async, mpsc::send) and
emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its
existing send semantics. The old emit_chunk helper that did the
unsafe full-decode-and-slice is removed entirely.

Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call
response after a long agent-zero session; the spawned TP stream task
panicked at candle.rs:2648. The model itself stayed healthy (no CUDA
fault), only the one streaming request died.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 18:01:24 +03:00

src

fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic

2026-05-27 18:01:24 +03:00

tests

refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker

2026-05-27 10:16:02 +03:00

build.rs

feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only

2026-05-21 11:34:11 +03:00

Cargo.toml

feat(stage-8d-7): direct safetensors fused-region loader

2026-05-21 17:49:35 +03:00