All checks were successful
CI / Format (push) Successful in 39s
CI / CUDA type-check (push) Successful in 1m38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m17s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
The request path now rejects prompts above the model's self-derived input budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host (where the VRAM ceiling binds below the static cap) rejects an over-budget prompt up front instead of accepting it and OOMing mid-prefill. - derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed by LoadedHandle::derived_limit (runs on every /models poll). 0 = not derived yet. - effective_prompt_cap(): cached derived input when >0, else the static max_prompt_tokens() (cold-start / no-profile fallback). - validate_request takes the cap as a param; all 4 call sites (chat_completion, inference_stream, inference_tp_stream, TP chat_completion) pass the in-scope model's effective_prompt_cap(). - doc/context-limits.md: enforcement note updated from "remaining" to landed. Reads the cap lock-free from the sync validate path (no per-request VRAM query); the cap tracks live state via the poll-driven derivation. With this, advertise and enforce agree and both track the resident model. fmt/clippy/test green; CUDA paths type-checked in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>