feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks

Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload cycle observed on beast under agent-zero workloads. Three changes, all keyed off env-driven knobs so an operator can tune without a rebuild: 1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The initial forward is split into N-token windows, each with a monotonically growing offset. KV cache accumulates across chunks exactly as it would under one big prefill; only the final chunk's logits are kept for sampling. Activation memory now scales with chunk size instead of prompt length, so a 13 k-token prompt stops holding tens of GB of intermediate activations live at once. Wired into all six prefill call sites: - run_inference / run_inference_streaming (CPU path) - run_inference_via_worker / stream_inference_via_worker (CUDA single-GPU through device worker) - chat_completion_tp_inner / chat_completion_tp_stream (TP via WorkerPool) Three helpers — chunked_prefill_local, chunked_prefill_via_worker, chunked_prefill_tp — own the loop shape so the chunking semantics stay identical across paths. Per-chunk debug log shows progress. 2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384). Requests above the cap return a structured 400 with `code: prompt_too_long` rather than going through the prefill and discovering the limit by OOMing partway through. New InferenceError::PromptTooLong variant. 3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500). If `vram_free_mb` is below the threshold at request start (e.g. another concurrent request is mid-prefill), reject with a clean 503 + `code: insufficient_vram` rather than starting work that will OOM. New InferenceError::InsufficientVram variant. CPU loads (vram=0 sentinel) skip this check. All three gates fire BEFORE any device work, so a rejected request costs ~one tokenisation pass and never touches the worker thread — poison cascades from rejected work are now impossible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:46:54 +03:00
parent 6e1c1dd0fc
commit 1e13889392
2 changed files with 294 additions and 22 deletions
--- a/crates/neuron/src/api.rs
+++ b/crates/neuron/src/api.rs
@@ -174,6 +174,31 @@ async fn chat_completions(
                Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
            )
                .into_response(),
+            Err(InferenceError::PromptTooLong { prompt_len, max }) => (
+                StatusCode::BAD_REQUEST,
+                Json(json!({
+                    "error": format!("prompt has {prompt_len} tokens but max is {max}"),
+                    "code": "prompt_too_long",
+                    "prompt_len": prompt_len,
+                    "max": max,
+                })),
+            )
+                .into_response(),
+            Err(InferenceError::InsufficientVram {
+                free_mb,
+                required_mb,
+            }) => (
+                StatusCode::SERVICE_UNAVAILABLE,
+                Json(json!({
+                    "error": format!(
+                        "insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
+                    ),
+                    "code": "insufficient_vram",
+                    "free_mb": free_mb,
+                    "required_mb": required_mb,
+                })),
+            )
+                .into_response(),
            Err(InferenceError::Other(e)) => (
                StatusCode::INTERNAL_SERVER_ERROR,
                Json(json!({"error": format!("{e:#}")})),
@@ -188,6 +213,31 @@ async fn chat_completions(
                Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
            )
                .into_response(),
+            Err(InferenceError::PromptTooLong { prompt_len, max }) => (
+                StatusCode::BAD_REQUEST,
+                Json(json!({
+                    "error": format!("prompt has {prompt_len} tokens but max is {max}"),
+                    "code": "prompt_too_long",
+                    "prompt_len": prompt_len,
+                    "max": max,
+                })),
+            )
+                .into_response(),
+            Err(InferenceError::InsufficientVram {
+                free_mb,
+                required_mb,
+            }) => (
+                StatusCode::SERVICE_UNAVAILABLE,
+                Json(json!({
+                    "error": format!(
+                        "insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
+                    ),
+                    "code": "insufficient_vram",
+                    "free_mb": free_mb,
+                    "required_mb": required_mb,
+                })),
+            )
+                .into_response(),
            Err(InferenceError::Other(e)) => (
                StatusCode::INTERNAL_SERVER_ERROR,
                Json(json!({"error": format!("{e:#}")})),