Files
helexa/crates/cortex-gateway
rob thijssen cb758d4706
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m46s
build-prerelease / Build neuron-ada (push) Successful in 2m9s
build-prerelease / Build cortex binary (push) Successful in 2m24s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Test (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
feat(neuron): emit usage on the streaming path so clients can track context
The deeper reason opencode showed "Context: 0 tokens / 0% used" and flew
into a 400: streaming responses carried NO `usage`. Clients track context
(and trigger compaction) from the `usage` field; the legacy candle
streaming path set `usage: None` on every chunk, so a streaming client
had no token count at all — `max_model_len` alone is a denominator with
no numerator.

InferenceEvent::Finish now carries prompt_tokens + completion_tokens
(the streaming loops already have both: prompt_tokens.len() and the
generated all_tokens.len()). The openai_chat projector emits an
OpenAI-style trailing usage chunk (empty `choices`, populated `usage`)
after the finish chunk. cortex's Anthropic stream translator already
reads chunk.usage, so this fixes context tracking on BOTH the OpenAI
(opencode) and Anthropic (Claude Code) paths.

Also harden the max_model_len plumbing's sibling: cortex re-polls
/discovery while a neuron's max_prompt_tokens is still 0 (unknown), so a
rolling-deploy race where cortex caches discovery before the neuron has
the field self-heals instead of pinning max_model_len to None until a
manual cortex restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:43:59 +03:00
..