feat(cortex-gateway): Rust-native context compressor for prompt token reduction #10

New Issue

grenade · 2026-06-01T05:55:45Z

grenade commented

2026-06-01 05:55:45 +00:00

Motivation

cortex-gateway proxies requests to neuron transparently — no manipulation of prompt content. For long conversations, large tool outputs, or RAG-stuffed prompts, every input token consumes prefill GPU compute and KV cache memory on the neuron side.

Headroom (https://github.com/chopratejas/headroom) demonstrates 60–95% prompt-token reduction across JSON tool outputs, code, and prose via type-aware compression. It's Python/TS only and built for the API-payment economy. We want the same compression behaviour but Rust-native, embedded in cortex-gateway, sized to a self-hosted neuron's actual cost structure (prefill compute + KV VRAM, not $/token).

Scope

A compression middleware in cortex-gateway. Before proxying to neuron:

Type detection per message: JSON tool output, source code, prose.
Per-type compressor:
- JSON: structural dedup / key compression (SmartCrusher equivalent).
- Code: AST-aware compression (preserve signatures; drop bodies of non-focal files).
- Prose: extractive summarization fallback (potentially via a small model running on neuron itself).
Reversibility: keep originals server-side (in-memory LRU on the gateway initially) so the model can retrieve(id) via a tool call.
Metrics: compression-ratio histogram per request, tagged by content type.

Open questions

Transparent middleware (every request) vs opt-in via header/config?
Originals lifetime: in-memory LRU keyed by request, or persistent store keyed by content hash?
Tool-call wiring for retrieve(): how is it declared in the OpenAI tool schema, and does it interact with helexa-acp's tool routing?
Acceptance bar: what compression ratio vs what semantic-loss tolerance is the v1 target?

Non-goals

Provider API cost reduction (neuron is self-hosted).
Cross-request prefix KV caching — tracked separately in #11.

References

Headroom: https://github.com/chopratejas/headroom
Gateway proxy path: crates/cortex-gateway/src/handlers.rs (proxy_with_metrics)
Anthropic↔OpenAI translation: crates/cortex-core/src/translate.rs

## Motivation cortex-gateway proxies requests to neuron transparently — no manipulation of prompt content. For long conversations, large tool outputs, or RAG-stuffed prompts, every input token consumes prefill GPU compute and KV cache memory on the neuron side. Headroom (https://github.com/chopratejas/headroom) demonstrates 60–95% prompt-token reduction across JSON tool outputs, code, and prose via type-aware compression. It's Python/TS only and built for the API-payment economy. We want the same compression behaviour but Rust-native, embedded in cortex-gateway, sized to a self-hosted neuron's actual cost structure (prefill compute + KV VRAM, not $/token). ## Scope A compression middleware in cortex-gateway. Before proxying to neuron: 1. **Type detection** per message: JSON tool output, source code, prose. 2. **Per-type compressor:** - JSON: structural dedup / key compression (SmartCrusher equivalent). - Code: AST-aware compression (preserve signatures; drop bodies of non-focal files). - Prose: extractive summarization fallback (potentially via a small model running on neuron itself). 3. **Reversibility:** keep originals server-side (in-memory LRU on the gateway initially) so the model can `retrieve(id)` via a tool call. 4. **Metrics:** compression-ratio histogram per request, tagged by content type. ## Open questions - Transparent middleware (every request) vs opt-in via header/config? - Originals lifetime: in-memory LRU keyed by request, or persistent store keyed by content hash? - Tool-call wiring for `retrieve()`: how is it declared in the OpenAI tool schema, and does it interact with helexa-acp's tool routing? - Acceptance bar: what compression ratio vs what semantic-loss tolerance is the v1 target? ## Non-goals - Provider API cost reduction (neuron is self-hosted). - Cross-request prefix KV caching — tracked separately in #11. ## References - Headroom: https://github.com/chopratejas/headroom - Gateway proxy path: `crates/cortex-gateway/src/handlers.rs` (`proxy_with_metrics`) - Anthropic↔OpenAI translation: `crates/cortex-core/src/translate.rs`

grenade referenced this issue

2026-06-01 05:56:26 +00:00

feat(neuron): prefix KV caching across requests #11

grenade commented

2026-06-12 08:57:47 +00:00

Closing as out-of-scope — counterproductive under the project's stability contract (README, 2026-06-12). cortex's value as the per-operator proxy is that it is transparent and predictable: what you send is what the model sees. A gateway that silently rewrites prompts — lossy compression, dropped code bodies, extractive summaries — is precisely the opaque mid-pipeline behavior helexa positions itself against, and it would make quality regressions undebuggable (is it the model, the quant, or the compressor?).

The legitimate cost problem named here (prefill compute + KV VRAM on repeated long contexts) is better attacked without semantic alteration: #11 (prefix KV caching) eliminates ~90% of repeated prefill for agent workloads and is now on the 7→8 milestone, alongside #23 (chunked prefill) and #25 (speculative decoding).

If prompt compression ever returns, it belongs client-side as explicit opt-in tooling, not as gateway middleware.

Closing as out-of-scope — counterproductive under the project's stability contract (README, 2026-06-12). cortex's value as the per-operator proxy is that it is transparent and predictable: what you send is what the model sees. A gateway that silently rewrites prompts — lossy compression, dropped code bodies, extractive summaries — is precisely the opaque mid-pipeline behavior helexa positions itself against, and it would make quality regressions undebuggable (is it the model, the quant, or the compressor?). The legitimate cost problem named here (prefill compute + KV VRAM on repeated long contexts) is better attacked without semantic alteration: #11 (prefix KV caching) eliminates ~90% of repeated prefill for agent workloads and is now on the 7→8 milestone, alongside #23 (chunked prefill) and #25 (speculative decoding). If prompt compression ever returns, it belongs client-side as explicit opt-in tooling, not as gateway middleware.

grenade added the out-of-scope label 2026-06-12 08:57:47 +00:00

grenade closed this issue

2026-06-12 08:57:47 +00:00

grenade referenced this issue

2026-06-12 08:58:14 +00:00

feat(neuron): prefix KV caching across requests #11

grenade referenced this issue

2026-06-12 09:02:07 +00:00

tracking: prioritised path to closing every open issue #27

Sign in to join this conversation.