feat(cortex-gateway): Rust-native context compressor for prompt token reduction #10
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Motivation
cortex-gateway proxies requests to neuron transparently — no manipulation of prompt content. For long conversations, large tool outputs, or RAG-stuffed prompts, every input token consumes prefill GPU compute and KV cache memory on the neuron side.
Headroom (https://github.com/chopratejas/headroom) demonstrates 60–95% prompt-token reduction across JSON tool outputs, code, and prose via type-aware compression. It's Python/TS only and built for the API-payment economy. We want the same compression behaviour but Rust-native, embedded in cortex-gateway, sized to a self-hosted neuron's actual cost structure (prefill compute + KV VRAM, not $/token).
Scope
A compression middleware in cortex-gateway. Before proxying to neuron:
retrieve(id)via a tool call.Open questions
retrieve(): how is it declared in the OpenAI tool schema, and does it interact with helexa-acp's tool routing?Non-goals
References
crates/cortex-gateway/src/handlers.rs(proxy_with_metrics)crates/cortex-core/src/translate.rsClosing as out-of-scope — counterproductive under the project's stability contract (README, 2026-06-12). cortex's value as the per-operator proxy is that it is transparent and predictable: what you send is what the model sees. A gateway that silently rewrites prompts — lossy compression, dropped code bodies, extractive summaries — is precisely the opaque mid-pipeline behavior helexa positions itself against, and it would make quality regressions undebuggable (is it the model, the quant, or the compressor?).
The legitimate cost problem named here (prefill compute + KV VRAM on repeated long contexts) is better attacked without semantic alteration: #11 (prefix KV caching) eliminates ~90% of repeated prefill for agent workloads and is now on the 7→8 milestone, alongside #23 (chunked prefill) and #25 (speculative decoding).
If prompt compression ever returns, it belongs client-side as explicit opt-in tooling, not as gateway middleware.