feat(cortex-gateway): Rust-native context compressor for prompt token reduction #10

Closed
opened 2026-06-01 05:55:45 +00:00 by grenade · 1 comment
Owner

Motivation

cortex-gateway proxies requests to neuron transparently — no manipulation of prompt content. For long conversations, large tool outputs, or RAG-stuffed prompts, every input token consumes prefill GPU compute and KV cache memory on the neuron side.

Headroom (https://github.com/chopratejas/headroom) demonstrates 60–95% prompt-token reduction across JSON tool outputs, code, and prose via type-aware compression. It's Python/TS only and built for the API-payment economy. We want the same compression behaviour but Rust-native, embedded in cortex-gateway, sized to a self-hosted neuron's actual cost structure (prefill compute + KV VRAM, not $/token).

Scope

A compression middleware in cortex-gateway. Before proxying to neuron:

  1. Type detection per message: JSON tool output, source code, prose.
  2. Per-type compressor:
    • JSON: structural dedup / key compression (SmartCrusher equivalent).
    • Code: AST-aware compression (preserve signatures; drop bodies of non-focal files).
    • Prose: extractive summarization fallback (potentially via a small model running on neuron itself).
  3. Reversibility: keep originals server-side (in-memory LRU on the gateway initially) so the model can retrieve(id) via a tool call.
  4. Metrics: compression-ratio histogram per request, tagged by content type.

Open questions

  • Transparent middleware (every request) vs opt-in via header/config?
  • Originals lifetime: in-memory LRU keyed by request, or persistent store keyed by content hash?
  • Tool-call wiring for retrieve(): how is it declared in the OpenAI tool schema, and does it interact with helexa-acp's tool routing?
  • Acceptance bar: what compression ratio vs what semantic-loss tolerance is the v1 target?

Non-goals

  • Provider API cost reduction (neuron is self-hosted).
  • Cross-request prefix KV caching — tracked separately in #11.

References

  • Headroom: https://github.com/chopratejas/headroom
  • Gateway proxy path: crates/cortex-gateway/src/handlers.rs (proxy_with_metrics)
  • Anthropic↔OpenAI translation: crates/cortex-core/src/translate.rs
## Motivation cortex-gateway proxies requests to neuron transparently — no manipulation of prompt content. For long conversations, large tool outputs, or RAG-stuffed prompts, every input token consumes prefill GPU compute and KV cache memory on the neuron side. Headroom (https://github.com/chopratejas/headroom) demonstrates 60–95% prompt-token reduction across JSON tool outputs, code, and prose via type-aware compression. It's Python/TS only and built for the API-payment economy. We want the same compression behaviour but Rust-native, embedded in cortex-gateway, sized to a self-hosted neuron's actual cost structure (prefill compute + KV VRAM, not $/token). ## Scope A compression middleware in cortex-gateway. Before proxying to neuron: 1. **Type detection** per message: JSON tool output, source code, prose. 2. **Per-type compressor:** - JSON: structural dedup / key compression (SmartCrusher equivalent). - Code: AST-aware compression (preserve signatures; drop bodies of non-focal files). - Prose: extractive summarization fallback (potentially via a small model running on neuron itself). 3. **Reversibility:** keep originals server-side (in-memory LRU on the gateway initially) so the model can `retrieve(id)` via a tool call. 4. **Metrics:** compression-ratio histogram per request, tagged by content type. ## Open questions - Transparent middleware (every request) vs opt-in via header/config? - Originals lifetime: in-memory LRU keyed by request, or persistent store keyed by content hash? - Tool-call wiring for `retrieve()`: how is it declared in the OpenAI tool schema, and does it interact with helexa-acp's tool routing? - Acceptance bar: what compression ratio vs what semantic-loss tolerance is the v1 target? ## Non-goals - Provider API cost reduction (neuron is self-hosted). - Cross-request prefix KV caching — tracked separately in #11. ## References - Headroom: https://github.com/chopratejas/headroom - Gateway proxy path: `crates/cortex-gateway/src/handlers.rs` (`proxy_with_metrics`) - Anthropic↔OpenAI translation: `crates/cortex-core/src/translate.rs`
Author
Owner

Closing as out-of-scope — counterproductive under the project's stability contract (README, 2026-06-12). cortex's value as the per-operator proxy is that it is transparent and predictable: what you send is what the model sees. A gateway that silently rewrites prompts — lossy compression, dropped code bodies, extractive summaries — is precisely the opaque mid-pipeline behavior helexa positions itself against, and it would make quality regressions undebuggable (is it the model, the quant, or the compressor?).

The legitimate cost problem named here (prefill compute + KV VRAM on repeated long contexts) is better attacked without semantic alteration: #11 (prefix KV caching) eliminates ~90% of repeated prefill for agent workloads and is now on the 7→8 milestone, alongside #23 (chunked prefill) and #25 (speculative decoding).

If prompt compression ever returns, it belongs client-side as explicit opt-in tooling, not as gateway middleware.

Closing as out-of-scope — counterproductive under the project's stability contract (README, 2026-06-12). cortex's value as the per-operator proxy is that it is transparent and predictable: what you send is what the model sees. A gateway that silently rewrites prompts — lossy compression, dropped code bodies, extractive summaries — is precisely the opaque mid-pipeline behavior helexa positions itself against, and it would make quality regressions undebuggable (is it the model, the quant, or the compressor?). The legitimate cost problem named here (prefill compute + KV VRAM on repeated long contexts) is better attacked without semantic alteration: #11 (prefix KV caching) eliminates ~90% of repeated prefill for agent workloads and is now on the 7→8 milestone, alongside #23 (chunked prefill) and #25 (speculative decoding). If prompt compression ever returns, it belongs client-side as explicit opt-in tooling, not as gateway middleware.
grenade added the out-of-scope label 2026-06-12 08:57:47 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/helexa#10