helexa/helexa

Fork 0

Files

rob thijssen 4b28a64b34

CI / Format (push) Successful in 39s

Details

CI / CUDA type-check (push) Successful in 1m38s

Details

CI / Clippy (push) Successful in 2m19s

Details

CI / Test (push) Successful in 4m17s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

build-prerelease / Resolve version stamps + change detection (push) Successful in 31s

Details

build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s

Details

build-prerelease / Build neuron-blackwell (push) Successful in 1m42s

Details

build-prerelease / Build neuron-ada (push) Successful in 2m15s

Details

build-prerelease / Build neuron-ampere (push) Successful in 2m17s

Details

build-prerelease / Build helexa-bench binary (push) Successful in 2m23s

Details

build-prerelease / Build cortex binary (push) Successful in 2m29s

Details

build-prerelease / Test (push) Successful in 4m28s

Details

build-prerelease / Package cortex RPM (push) Successful in 1m15s

Details

build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s

Details

feat(#67 phase 5b): enforce the derived input as the prompt cap

The request path now rejects prompts above the model's self-derived input
budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host
(where the VRAM ceiling binds below the static cap) rejects an
over-budget prompt up front instead of accepting it and OOMing
mid-prefill.

- derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed
  by LoadedHandle::derived_limit (runs on every /models poll). 0 = not
  derived yet.
- effective_prompt_cap(): cached derived input when >0, else the static
  max_prompt_tokens() (cold-start / no-profile fallback).
- validate_request takes the cap as a param; all 4 call sites
  (chat_completion, inference_stream, inference_tp_stream, TP
  chat_completion) pass the in-scope model's effective_prompt_cap().
- doc/context-limits.md: enforcement note updated from "remaining" to
  landed.

Reads the cap lock-free from the sync validate path (no per-request VRAM
query); the cap tracks live state via the poll-driven derivation. With
this, advertise and enforce agree and both track the resident model.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-17 14:26:37 +03:00

9.5 KiB

Raw Blame History

Context-window & token-limit settings

How the numeric knobs that govern usable context fit together, what the valid ranges are, and where they live. Getting these out of sync is the difference between "the agent has room to think" and "it compacts every few turns and reasons from a corrupted summary."

The tables below document the manual reasoning that #62 and then #67 automate. As of #67 the neuron computes this limit itself from model architecture + live VRAM + a self-measured throughput ceiling and advertises it on GET /models; operators no longer hand-derive it. Read the rules below as the why behind the derivation — see After #67 for what the daemon now does automatically.

The knobs

Knob	Where	What it bounds
`max_position_embeddings`	model `config.json` (fixed per model)	the model's native context ceiling — quality wall
`NEURON_MAX_PROMPT_TOKENS`	neuron systemd drop-in (env)	hard prompt cap; neuron rejects larger prompts with `400 context_length_exceeded` before any device work
`NEURON_MIN_FREE_VRAM_MB`	neuron systemd drop-in (env, default 1500)	static free-VRAM floor below which prefill is refused (`503 service_unavailable` / `InsufficientVram`)
request `max_tokens`	per request; neuron default 8192	generation length; KV grows by prompt + generation
`limit.context`	`opencode.json` `provider.models.<id>.limit`	the wall opencode tracks for compaction %
`limit.input`	same	compaction trigger — opencode compacts to keep the prompt at/under this
`limit.output`	same	generation reserve opencode leaves below the wall

How they must relate

For a single model on a single neuron, all of these must hold:

1.  limit.input + limit.output  ≤  limit.context          (opencode internal; convention: input = context − output)
2.  limit.context               ≤  max_position_embeddings (model quality wall)
3.  limit.input                 ≤  NEURON_MAX_PROMPT_TOKENS (else neuron 400s a prompt opencode thought was fine)
4.  NEURON_MAX_PROMPT_TOKENS + max_tokens  ≤  max_position_embeddings
5.  KV(limit.context)/card + activation + NEURON_MIN_FREE_VRAM_MB  ≤  free VRAM on the tightest card

Notes:

Keep a margin on rule 3. Set NEURON_MAX_PROMPT_TOKENS a bit above limit.input (e.g. one output-worth) so opencode↔neuron tokenizer counting differences don't trip a spurious 400 mid-session.
Convention: mirror limit.context to NEURON_MAX_PROMPT_TOKENS and set limit.input = context − output. opencode then compacts to keep the prompt one output below the neuron wall — there is always generation headroom under the cap.
Rule 5 is the one with teeth at scale. Today only the static floor (NEURON_MIN_FREE_VRAM_MB) guards the text path; it does not scale with prompt length. A long-but-under-cap prompt can clear the floor and then OOM mid-prefill (poisoning the device context). Tracked in #65 — until that lands, treat the VRAM-safe ceiling in rule 5 as a hard limit you set NEURON_MAX_PROMPT_TOKENS below, not something the daemon enforces.

VRAM cost of context (Qwen3.6-27B on beast)

Qwen3.6-27B is a hybrid linear-attention model: of its 64 layers only every 4th is full-attention (full_attention_interval = 4 → 16 full-attn layers); the rest are linear_attention with constant-size recurrent state. KV cache grows only on the 16 full-attn layers (GQA, num_key_value_heads = 4, head_dim = 256, F16):

kv_per_token (total) = 2 (K+V) × 16 layers × 4 kv_heads × 256 head_dim × 2 B = 65536 B = 64 KiB/token
kv_per_token (per card, TP=2)                                                            = 32 KiB/token

beast = 2× RTX 5090 (32607 MiB each). KV per card and headroom against the measured idle free of the tighter card (GPU 1: 9254 MiB):

`limit.context`	KV / card	Free left on GPU 1 (after KV)	Verdict
49152 (≈49k, prior default)	~1.5 GiB	~7.7 GiB	very safe
131072 (128k, recommended)	~4.0 GiB	~5.2 GiB	safe
196608 (192k, stretch)	~6.0 GiB	~3.1 GiB	plausible; wants #65 guard
262144 (256k, model max)	~8.0 GiB	~1.1 GiB	unsafe at current free (under the 1500 MiB floor)

ConcatKvCache is lazy — it allocates nothing at idle and resets between requests — so raising the cap costs zero until a session actually uses the longer window. The numbers above are upper bounds at measured idle free; real usable headroom is lower under fragmentation and whatever else is resident. Leave margin.

Reaching 256k (or running concurrent long sessions) needs more free VRAM than this load leaves — KV quantization or a fixed/paged KV allocator — none of which is required for 128k.

Recommended profile: 128k

neuron — NEURON_MAX_PROMPT_TOKENS is deploy-managed, not hand-edited. It lives in the deploy-neurons matrix in .gitea/workflows/deploy.yml (max_prompt_tokens per host) and is written to /etc/systemd/system/neuron.service.d/model.conf on each run. A change to that value restarts the neuron even when no new RPM ships (the deploy gates on package version or drop-in change), so the cap rolls out alongside the rest of the service config. To change it, edit the matrix value and let the deploy apply it:

# .gitea/workflows/deploy.yml → jobs.deploy-neurons.strategy.matrix.include
- host: beast.hanzalova.internal
  flavour: blackwell
  load_timeout: 900
  max_prompt_tokens: 131072

The drop-in it writes:

# /etc/systemd/system/neuron.service.d/model.conf  (managed by deploy.yml)
[Service]
Environment=NEURON_MAX_PROMPT_TOKENS=131072

Verify after a deploy:

curl -s http://beast:13131/discovery | jq .max_prompt_tokens   # expect 131072

model.conf sorts after any manual local.conf, so the deploy-managed value wins over a hand override of the same variable. Use local.conf only for genuinely host-local, transient experiments — and remember a later deploy will re-assert model.conf.

opencode — opencode.json, provider.models."Qwen/Qwen3.6-27B".limit:

{ "context": 131072, "input": 122880, "output": 8192 }

(input = context − output = 131072 − 8192; NEURON_MAX_PROMPT_TOKENS 131072 sits one output above input, the tokenizer-drift margin.)

After #62: single source of truth (superseded by #67)

#62 moved limit { context, input, output } (and cost) onto GET /models, sourced from the operator-declared catalogue (models.toml). That was the right plumbing but the wrong source: a per-model catalogue limit goes stale the moment cortex hot-swaps a neuron's resident model, and forces the hand-tuning fight (the tables above) to be re-run on every change.

After #67: the neuron computes its own limit

#67 makes the limit a computed function of live state, not an operator-declared fact. Per loaded model, the neuron derives:

output  = output_reserve_tokens                       (config; default 8192)
kv/token/card = 2(K+V) · n_full_attn_layers · (n_kv_heads / tp) · head_dim · dtype_bytes
vram_ceiling       = (free_tightest − activation_headroom − min_free_floor) / kv_per_token_per_card
throughput_ceiling = target_prefill_latency_secs · measured_prefill_tok_per_sec
context = min(max_position_embeddings, vram_ceiling, throughput_ceiling)
          clamped by NEURON_MAX_PROMPT_TOKENS only if explicitly set (backstop)
input   = context − output

free_tightest is the minimum free VRAM across the model's devices — the tightest card, often a non-leader TP rank.
measured_prefill_tok_per_sec is self-measured (an EMA over real requests; a configured bootstrap until the first sample). Because it reads live state, the advertised limit rises automatically as prefix caching (#11) or other efficiency work frees VRAM / speeds prefill — no operator action.
Knobs live in [harness.candle.context_limit] (see neuron.example.toml). The catalogue limit is no longer consulted (the field is inert/deprecated); cost stays operator-set in the catalogue.
opencode: remove any hand-entered limit block from opencode.json — discovery is authoritative.

NEURON_MAX_PROMPT_TOKENS is demoted from authority to an optional clamp-only backstop (applied only when explicitly set). The deploy-managed drop-in still pins a per-host ceiling, but the derivation binds below it in practice.

The request path enforces the derived cap: a prompt is rejected with PromptTooLong when it exceeds the model's computed input budget (refreshed on every /models poll), not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host rejects an over-budget prompt up front instead of OOMing mid-prefill. Before the first derivation (or for an arch without a context profile) it falls back to the static cap.

Operational note

GET /discovery reports the live max_prompt_tokens the running neuron process actually uses — check it rather than assuming the drop-in took effect. A drop-in change only applies after daemon-reload + a neuron restart, which the deploy performs; if /discovery doesn't match the max_prompt_tokens in the deploy matrix, the host hasn't been re-deployed since the value changed (or a higher-sorting drop-in is overriding it). Re-run the deploy workflow to reconcile.

9.5 KiB Raw Blame History Unescape Escape

Context-window & token-limit settings

The knobs

How they must relate

VRAM cost of context (Qwen3.6-27B on beast)

Recommended profile: 128k

After #62: single source of truth (superseded by #67)

After #67: the neuron computes its own limit

Operational note

9.5 KiB

Raw Blame History