# Context-window & token-limit settings How the numeric knobs that govern usable context fit together, what the valid ranges are, and where they live. Getting these out of sync is the difference between "the agent has room to think" and "it compacts every few turns and reasons from a corrupted summary." The tables below document the **manual** reasoning that [#62](https://git.lair.cafe/helexa/helexa/issues/62) and then [#67](https://git.lair.cafe/helexa/helexa/issues/67) automate. As of #67 the neuron **computes** this limit itself from model architecture + live VRAM + a self-measured throughput ceiling and advertises it on `GET /models`; operators no longer hand-derive it. Read the rules below as the *why* behind the derivation — see [After #67](#after-67-the-neuron-computes-its-own-limit) for what the daemon now does automatically. ## The knobs | Knob | Where | What it bounds | |---|---|---| | `max_position_embeddings` | model `config.json` (fixed per model) | the model's native context ceiling — quality wall | | `NEURON_MAX_PROMPT_TOKENS` | neuron systemd drop-in (env) | hard **prompt** cap; neuron rejects larger prompts with `400 context_length_exceeded` before any device work | | `NEURON_MIN_FREE_VRAM_MB` | neuron systemd drop-in (env, default 1500) | static free-VRAM floor below which prefill is refused (`503 service_unavailable` / `InsufficientVram`) | | request `max_tokens` | per request; neuron default 8192 | generation length; KV grows by prompt **+** generation | | `limit.context` | `opencode.json` `provider.models..limit` | the wall opencode tracks for compaction % | | `limit.input` | same | compaction trigger — opencode compacts to keep the prompt at/under this | | `limit.output` | same | generation reserve opencode leaves below the wall | ## How they must relate For a single model on a single neuron, all of these must hold: ``` 1. limit.input + limit.output ≤ limit.context (opencode internal; convention: input = context − output) 2. limit.context ≤ max_position_embeddings (model quality wall) 3. limit.input ≤ NEURON_MAX_PROMPT_TOKENS (else neuron 400s a prompt opencode thought was fine) 4. NEURON_MAX_PROMPT_TOKENS + max_tokens ≤ max_position_embeddings 5. KV(limit.context)/card + activation + NEURON_MIN_FREE_VRAM_MB ≤ free VRAM on the tightest card ``` Notes: - **Keep a margin on rule 3.** Set `NEURON_MAX_PROMPT_TOKENS` a bit above `limit.input` (e.g. one `output`-worth) so opencode↔neuron tokenizer counting differences don't trip a spurious 400 mid-session. - **Convention:** mirror `limit.context` to `NEURON_MAX_PROMPT_TOKENS` and set `limit.input = context − output`. opencode then compacts to keep the prompt one `output` below the neuron wall — there is always generation headroom under the cap. - **Rule 5 is the one with teeth at scale.** Today only the *static* floor (`NEURON_MIN_FREE_VRAM_MB`) guards the text path; it does **not** scale with prompt length. A long-but-under-cap prompt can clear the floor and then OOM mid-prefill (poisoning the device context). Tracked in [#65](https://git.lair.cafe/helexa/helexa/issues/65) — until that lands, treat the VRAM-safe ceiling in rule 5 as a hard limit you set `NEURON_MAX_PROMPT_TOKENS` below, not something the daemon enforces. ## VRAM cost of context (Qwen3.6-27B on beast) Qwen3.6-27B is a hybrid linear-attention model: of its 64 layers only every 4th is full-attention (`full_attention_interval = 4` → **16** full-attn layers); the rest are `linear_attention` with constant-size recurrent state. KV cache grows **only** on the 16 full-attn layers (GQA, `num_key_value_heads = 4`, `head_dim = 256`, F16): ``` kv_per_token (total) = 2 (K+V) × 16 layers × 4 kv_heads × 256 head_dim × 2 B = 65536 B = 64 KiB/token kv_per_token (per card, TP=2) = 32 KiB/token ``` beast = 2× RTX 5090 (32607 MiB each). KV per card and headroom against the **measured idle free of the tighter card (GPU 1: 9254 MiB)**: | `limit.context` | KV / card | Free left on GPU 1 (after KV) | Verdict | |---|---|---|---| | 49152 (≈49k, prior default) | ~1.5 GiB | ~7.7 GiB | very safe | | **131072 (128k, recommended)** | **~4.0 GiB** | **~5.2 GiB** | **safe** | | 196608 (192k, stretch) | ~6.0 GiB | ~3.1 GiB | plausible; wants #65 guard | | 262144 (256k, model max) | ~8.0 GiB | ~1.1 GiB | unsafe at current free (under the 1500 MiB floor) | `ConcatKvCache` is lazy — it allocates nothing at idle and resets between requests — so raising the cap costs zero until a session actually uses the longer window. The numbers above are upper bounds at *measured idle free*; real usable headroom is lower under fragmentation and whatever else is resident. Leave margin. Reaching 256k (or running concurrent long sessions) needs more free VRAM than this load leaves — KV quantization or a fixed/paged KV allocator — none of which is required for 128k. ## Recommended profile: 128k **neuron** — `NEURON_MAX_PROMPT_TOKENS` is **deploy-managed**, not hand-edited. It lives in the `deploy-neurons` matrix in `.gitea/workflows/deploy.yml` (`max_prompt_tokens` per host) and is written to `/etc/systemd/system/neuron.service.d/model.conf` on each run. A change to that value restarts the neuron **even when no new RPM ships** (the deploy gates on package version *or* drop-in change), so the cap rolls out alongside the rest of the service config. To change it, edit the matrix value and let the deploy apply it: ```yaml # .gitea/workflows/deploy.yml → jobs.deploy-neurons.strategy.matrix.include - host: beast.hanzalova.internal flavour: blackwell load_timeout: 900 max_prompt_tokens: 131072 ``` The drop-in it writes: ```ini # /etc/systemd/system/neuron.service.d/model.conf (managed by deploy.yml) [Service] Environment=NEURON_MAX_PROMPT_TOKENS=131072 ``` Verify after a deploy: ```sh curl -s http://beast:13131/discovery | jq .max_prompt_tokens # expect 131072 ``` `model.conf` sorts after any manual `local.conf`, so the deploy-managed value wins over a hand override of the same variable. Use `local.conf` only for genuinely host-local, transient experiments — and remember a later deploy will re-assert `model.conf`. **opencode** — `opencode.json`, `provider.models."Qwen/Qwen3.6-27B".limit`: ```json { "context": 131072, "input": 122880, "output": 8192 } ``` (`input = context − output = 131072 − 8192`; `NEURON_MAX_PROMPT_TOKENS` 131072 sits one `output` above `input`, the tokenizer-drift margin.) ## After #62: single source of truth (superseded by #67) [#62](https://git.lair.cafe/helexa/helexa/issues/62) moved `limit { context, input, output }` (and `cost`) onto `GET /models`, sourced from the operator-declared catalogue (`models.toml`). That was the right plumbing but the wrong *source*: a per-model catalogue limit goes stale the moment cortex hot-swaps a neuron's resident model, and forces the hand-tuning fight (the tables above) to be re-run on every change. ## After #67: the neuron computes its own limit [#67](https://git.lair.cafe/helexa/helexa/issues/67) makes the limit a **computed function of live state**, not an operator-declared fact. Per loaded model, the neuron derives: ``` output = output_reserve_tokens (config; default 8192) kv/token/card = 2(K+V) · n_full_attn_layers · (n_kv_heads / tp) · head_dim · dtype_bytes vram_ceiling = (free_tightest − activation_headroom − min_free_floor) / kv_per_token_per_card throughput_ceiling = target_prefill_latency_secs · measured_prefill_tok_per_sec context = min(max_position_embeddings, vram_ceiling, throughput_ceiling) clamped by NEURON_MAX_PROMPT_TOKENS only if explicitly set (backstop) input = context − output ``` - `free_tightest` is the **minimum free VRAM across the model's devices** — the tightest card, often a non-leader TP rank. - `measured_prefill_tok_per_sec` is **self-measured** (an EMA over real requests; a configured bootstrap until the first sample). Because it reads live state, the advertised `limit` **rises automatically** as prefix caching (#11) or other efficiency work frees VRAM / speeds prefill — no operator action. - Knobs live in `[harness.candle.context_limit]` (see `neuron.example.toml`). The catalogue `limit` is **no longer consulted** (the field is inert/deprecated); `cost` stays operator-set in the catalogue. - **opencode**: remove any hand-entered `limit` block from `opencode.json` — discovery is authoritative. `NEURON_MAX_PROMPT_TOKENS` is demoted from authority to an **optional clamp-only backstop** (applied only when explicitly set). The deploy-managed drop-in still pins a per-host ceiling, but the derivation binds below it in practice. The request path **enforces the derived cap**: a prompt is rejected with `PromptTooLong` when it exceeds the model's computed `input` budget (refreshed on every `/models` poll), not the static `NEURON_MAX_PROMPT_TOKENS` — so a VRAM-tight host rejects an over-budget prompt up front instead of OOMing mid-prefill. Before the first derivation (or for an arch without a context profile) it falls back to the static cap. ## Operational note `GET /discovery` reports the live `max_prompt_tokens` the running neuron process actually uses — check it rather than assuming the drop-in took effect. A drop-in change only applies after `daemon-reload` + a neuron restart, which the deploy performs; if `/discovery` doesn't match the `max_prompt_tokens` in the deploy matrix, the host hasn't been re-deployed since the value changed (or a higher-sorting drop-in is overriding it). Re-run the `deploy` workflow to reconcile.