Regression from #49: the auth middleware rejected ANY present-but-
unresolvable bearer token with 401 invalid_api_key, even when
require_auth=false. But OpenAI-compatible clients (opencode, Open WebUI,
Agent Zero, litellm) send a placeholder bearer by default — so enabling
the build broke every existing client even though the operator never
opted into auth. Pre-#49 the bearer was never inspected at all.
Fix: in allow-anonymous mode (require_auth=false, the default) an
unrecognized key is now ignored and the request is served anonymously,
restoring pre-#49 behaviour. A bad key only 401s when require_auth=true.
A valid key is still resolved + metered in both modes.
Test renamed/split: unrecognized_key_is_ignored_when_auth_not_required
(now 200, served anonymously) + invalid_key_is_401_when_auth_required.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 3 (DX): A0 burned an hour then failed deep in litellm with
prompt_too_long (35544 > 32768). cortex knows each model's real context
window (#62/#67) and can pre-empt that at the edge.
- Pre-validate the prompt against the model's advertised limit.context
before dispatch (in proxy_with_metrics, covering chat/completions/
responses). Over → 400 context_length_exceeded in the #60 envelope — the
same shape neuron emits on overflow, just earlier and without burning a
cold-load/queue slot. cortex has no tokenizer, so estimate_prompt_tokens
under-counts (~4 chars/token over message text); neuron stays the exact
wall and we only catch gross overages. Skipped when no limit is known.
- Advisory X-Helexa-Advice header: fingerprints User-Agent
(litellm / Agent-Zero / Zed) and attaches client-specific guidance.
Strictly advisory — header only, never in the error envelope, behaviour
never depends on it; unknown clients get nothing.
3 integration tests: over-long prompt → 400 context_length_exceeded with
the advice header, refused before neuron is hit; within-context passes
through; unknown client gets a clean 400 with no advice header. cortex-side
(no CUDA); local fmt/clippy/test green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 2 completes: when a model is loaded on more than one healthy neuron,
the router picks the least-busy replica instead of always taking the first,
and neuron backpressure propagates to the client intact.
- NodeState.model_load: per-model admission load (in_flight + queue_depth),
stashed by the poller from neuron's /health (#53/#2b).
- router::resolve collects all loaded replicas and picks the one with the
lowest in_flight+queue_depth (ties break by node name for determinism),
replacing the previous first-match-wins.
- Backpressure passthrough: the existing streaming proxy already forwards
the upstream status + all headers verbatim, so a neuron 503/429 +
Retry-After + #60 envelope reaches the client unmodified — now covered by
a regression test so a future change can't silently unwrap it.
Tests (tests/load_routing.rs): routes to the idle replica and follows the
lighter load when it flips; ties break by name; a saturated neuron's 503 +
Retry-After + envelope propagates through the gateway intact. All
cortex-side (no CUDA); local fmt/clippy/test green.
Retry-route-to-another-replica-on-backpressure (the issue's stretch goal)
is deferred — least-busy spread + honest passthrough is the substantive win.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Budget caps total spend over time (#52); this caps instantaneous
starvation so one principal's burst can't monopolize a model while others
wait.
- AdmissionController gains per-principal accounting (moved from a lone
atomic to a Mutex<AdmissionState> holding the overall pending count + a
per-principal map). enter(principal) now also fast-rejects when a
principal already has max_per_principal requests in flight/queued →
AdmissionRejection::PrincipalCap. Anonymous (None) requests are exempt.
- Config [harness.candle.admission].max_per_principal (default 2 = one
running + one queued; 0 disables). A bursting principal's overflow is
refused while a different principal still gets a queue slot.
- The principal (account/key) is reconstructed on the neuron side from the
x-helexa-account-id/key-id headers cortex stamps (#49) — trusted over
WireGuard, never from the request body — and threaded explicitly through
all inference entry points (chat_completion, *_stream(_with),
responses_stream, and the TP variants) to the admission gate.
- InferenceError::PerPrincipalLimit → 429 rate_limit_exceeded + Retry-After
(distinct from load-shedding's 503 Overloaded); opencode/AI SDK self-pace.
Tests: fair-share unit test (A floods → A's 2nd is PrincipalCap, B still
queues + is served) + the existing admission tests adapted to enter(None).
Non-CUDA build green locally; TP entry points (cuda-gated) validated by CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes #53: the bounded scheduler's lock-free counters are now visible
to the fleet, which is what cortex's load-aware router (#55) consumes to
spread traffic across replicas and propagate honest backpressure.
- cortex-core::discovery: HealthResponse gains `models: Vec<ModelLoad>`
(#[serde(default)] — back-compatible; older gateways/neurons interop).
ModelLoad { id, in_flight, queue_depth }.
- LoadedHandle::load() → (in_flight, queue_depth), lock-free for both
single-GPU and TP; CandleHarness::load_snapshot() enumerates resident
models; the /health handler overlays it from the candle harness.
Tests: /health always exposes a models array (api integration test); a
pre-#53 payload without `models` still deserializes, and ModelLoad
round-trips (cortex-core serde tests). Local fmt/clippy/test green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the per-model unbounded, untimed FIFO of inference-lock waiters
(a busy model made new requests hang ~300s until the client gave up with
an opaque error) with an explicit bounded scheduler.
- harness::admission::AdmissionController: batch-1 scheduler — max_in_flight
running (1) + a bounded queue (max_queue_depth) with a max_wait. enter()
fast-rejects when the queue is full (QueueFull) or the wait elapses
(Timeout); the returned AdmissionPermit is held for the request and frees
both slots on drop. Pure async (no CUDA), lock-free in_flight/queue_depth
counters for future /health reporting. Configurable via
[harness.candle.admission] (max_in_flight=1, max_queue_depth=8,
max_wait_secs=30).
- Gated at all four inference entry points before the inference_lock/pool
lock: single-GPU non-streaming + streaming, TP non-streaming + streaming.
The streaming paths acquire the permit before opening the SSE (so a
rejection is a clean error, not a half-open stream) and move it into the
inference task.
- InferenceError::Overloaded { retry_after_secs } → 503 rate_limit_exceeded
+ Retry-After via the #60/#63 envelope: a fast, retryable "busy" signal
opencode/AI SDK back off on, not a stall.
Scope: this branch is the admission *core* (the hang→backpressure fix).
Exposing in_flight/queue_depth in GET /health (consumed by cortex
load-aware routing #55) is the next focused branch under #53.
4 unit tests (admit/report load, queue-full reject, wait-timeout reject)
+ Overloaded envelope mapping test. Non-CUDA build green locally; the
CUDA + TP sites are validated by branch CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 1 complete: the A0 seatbelt (#52). Flips the metering-only reserve(0)
from #51 to the request's real upper-bound cost and refuses over-cap
requests *before* neuron is hit.
- metering::reservation_estimate: prompt estimate (~4 chars/token over the
body — cortex has no tokenizer, so a conservative over-estimate; neuron
stays the exact context wall) + max output. Max output comes from
max_completion_tokens / legacy max_tokens, else the model's advertised
limit.output (#62), else FALLBACK_MAX_OUTPUT. Over-reserving is safe —
settle reconciles to actual.
- metering::reserve_or_reject: reserve the estimate; on BudgetError map to
the #63 envelope and the caller refuses before dispatch — rolling window →
429 rate_limit_exceeded + Retry-After (until reset); hard balance → 429
insufficient_quota (no Retry-After). Never 402.
- Wired into both the OpenAI proxy path (proxy_with_metrics) and the
Anthropic path (estimate from the translated body). advertised_output_limit
reads the loaded model's limit.output from fleet state.
- Reservation prevents overshoot under concurrency: a successful reserve
gates on spent+reserved+estimate ≤ cap, and settle records actual ≤
reserved, so spend can never exceed the hard cap.
4 integration tests with a hit-counting mock neuron: balance over-cap →
429 insufficient_quota (no Retry-After, not dispatched); rolling over-cap →
429 rate_limit_exceeded + Retry-After (not dispatched); within-cap served;
**A0 repro** — a capped key's 20-request fan-out drains the cap, then is
refused, neuron only saw the served ones, and spend never exceeds the cap.
Plus 5 metering unit tests. Local fmt/clippy/test all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 1 accounting (#51): capture real per-request usage and feed it to
the spend ledger + per-principal metrics. Establishes the reserve→settle
lifecycle that budget enforcement (#52) will tighten.
- cortex-gateway::metering: ReservationGuard makes reservation leaks
impossible — settle() records actual spend + releases the remainder;
dropping an un-settled guard releases the whole reservation, so any
early return / error / dropped stream resolves it. UsageSink is the
completion hook; principal_from_headers reconstructs the principal from
the middleware-stamped headers (uniform across all proxy paths, no
handler-signature churn); record_spend emits per-principal counters.
- proxy::TokenMetrics gains an optional usage_sink, invoked exactly once
in finish() with the observed (prompt, completion) — restructured so it
always runs (even when no body/usage arrived → settle 0 → release),
while preserving the existing per-model metric emissions unchanged.
- All proxy paths metered: chat/completions/responses via
proxy_with_metrics (reserve 0 → forward_request → settle in finish);
Anthropic non-streaming settles from the buffered body; Anthropic
streaming (anthropic_sse) now scans the upstream frames for the usage
object (#48) — it captured none before — and settles at pump end.
- This phase reserves 0 tokens (metering only, no enforcement); #52 flips
the reserved amount to prompt+max_output and surfaces BudgetError. The
settle/release plumbing is identical, so that change is localized.
- New Prometheus counters: cortex_spend_tokens_total (+ prompt/completion
splits), labelled by account/key.
2 integration tests: cumulative per-key spend after N requests with
reservations settled to zero outstanding; anonymous requests record no
spend. Local fmt/clippy/test all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 1 identity (#49): cortex now knows who a request is for. Identity
rides standard bearer auth only (Authorization: Bearer <key>) — no custom
required headers or body fields — which is what keeps every tier
OpenAI-compatible by construction.
- cortex-gateway::auth: `require_principal` axum middleware
(from_fn_with_state), wired in build_app outer-to-inner as
trace → CORS → auth → handlers (CORS outer so preflight short-circuits).
It resolves the bearer key via the EntitlementProvider, inserts the
typed Principal into request extensions (for metering #51 / enforcement
#52), and stamps internal x-helexa-account-id / x-helexa-key-id headers
so the principal reaches neuron, which trusts cortex over WireGuard (#54).
- Anti-spoofing: client-supplied principal headers are stripped before the
authoritative value is stamped — a client can never assert a principal
it didn't authenticate as.
- Rejection contract (#63): missing key under require_auth, or any present
but unresolvable key, → 401 invalid_api_key in the #60 envelope. /health
and / stay public. require_auth=false (default) allows anonymous through
but still 401s a present-but-invalid key.
- Header-name constants (HEADER_ACCOUNT_ID/KEY_ID) live in cortex-core so
neuron (#54) shares them. The chat/completions/responses paths forward
the stamped headers automatically via proxy::forward_request; the
Anthropic streaming + non-streaming paths forward them explicitly via
auth::forward_principal_headers (they build their own upstream requests).
5 integration tests: missing-key 401, invalid-key 401 (even when auth not
required, not dispatched), valid key reaches neuron with principal headers
+ spoofed header stripped, anonymous allowed when not required, /health
public. Local fmt/clippy/test all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 1's build seam (#50): the interface auth, metering, and budget
enforcement all hang off, with a local/static provider so the A0
amplification fix can land before any upstream clearing house exists.
The future helexa-upstream client (#57) is just another impl.
- cortex-core::entitlements: Principal {account_id, key_id}, CapWindow
(Balance | Rolling{seconds}), Reservation handle, BudgetSnapshot,
AuthError/BudgetError, and the async EntitlementProvider trait
(resolve / reserve / settle / release / snapshot). BudgetError carries
the window semantics so callers pick the #63 code (rate_limit_exceeded
+ Retry-After vs insufficient_quota) without the provider touching HTTP.
- cortex-core::config: [entitlements] section on GatewayConfig
(require_auth + [[entitlements.keys]] with account_id, optional key_id,
hard_cap, window). Additive + serde(default) — anonymous/uncapped when
omitted, so existing setups are unaffected.
- cortex-gateway::entitlements_local: LocalEntitlementProvider. Budget
math serialized under one Mutex so spent+reserved can never exceed a
hard cap under concurrency (the #52 guarantee); rolling windows reset
lazily; uncapped keys (no hard_cap) always reserve but still meter.
- CortexState gains Arc<dyn EntitlementProvider> + require_auth, built in
from_config. Not yet consumed by the request path — auth middleware is
1b (#49), enforcement is 1d (#52).
- cortex.example.toml documents the section; test GatewayConfig literals
updated for the new field.
6 provider unit tests (resolve, unknown-key, round-trip, balance/rolling
over-cap codes, uncapped infra key). Local fmt/clippy/test all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rejection contract (#63) requires every "no" path to speak the
OpenAI envelope with standard codes and, for retryable conditions, a
Retry-After header. Two gaps remained despite #63 being closed:
Retry-After was implemented nowhere, and the envelope was hand-built
inline in four places (gateway handlers/proxy/router, neuron api) with
no shared source of truth — exactly the inconsistency #63 set out to
prevent, and a foundation every Stage 1-2 rejection (401/429/503) needs.
- cortex-core: new `error_envelope::OpenAiError` — an axum-agnostic
builder carrying status, type, code, message, param, optional
retry_after, and diagnostic extras. Named constructors encode the #63
codes (invalid_api_key, rate_limit_exceeded, insufficient_quota,
context_length_exceeded, service_unavailable) and which carry
Retry-After. cortex-core stays a pure types crate; each HTTP crate
owns a thin `envelope_response` adapter that sets the header.
- cortex-gateway: route error_response, ProxyError, and RouteError
through the shared builder; RouteError::retry_after_secs wires
Retry-After on the transient NoHealthyNodes (5s) / ModelRecovering
(2s) variants.
- neuron: route inference_error_response through the shared builder;
InsufficientVram (transient 503) now advertises Retry-After: 5.
Behaviour for existing paths is unchanged (same status/type/code/extras);
only the new Retry-After headers are added. Tests cover the builder wire
shape and Retry-After presence/absence on both sides.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The request path now rejects prompts above the model's self-derived input
budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host
(where the VRAM ceiling binds below the static cap) rejects an
over-budget prompt up front instead of accepting it and OOMing
mid-prefill.
- derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed
by LoadedHandle::derived_limit (runs on every /models poll). 0 = not
derived yet.
- effective_prompt_cap(): cached derived input when >0, else the static
max_prompt_tokens() (cold-start / no-profile fallback).
- validate_request takes the cap as a param; all 4 call sites
(chat_completion, inference_stream, inference_tp_stream, TP
chat_completion) pass the in-scope model's effective_prompt_cap().
- doc/context-limits.md: enforcement note updated from "remaining" to
landed.
Reads the cap lock-free from the sync validate path (no per-request VRAM
query); the cap tracks live state via the poll-driven derivation. With
this, advertise and enforce agree and both track the resident model.
fmt/clippy/test green; CUDA paths type-checked in CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Demotes the static per-host prompt cap from authority to an optional
upper-bound clamp on the self-derived limit, and rewrites the
context-limits doc around the computed model.
- max_prompt_tokens_clamp(): reads NEURON_MAX_PROMPT_TOKENS directly so
"explicitly set" is distinct from the 16384 default; returns None when
unset (no clamp). Applied as derive_limit's hard_ceiling in
LoadedHandle::derived_limit, so the advertised context is clamped only
when an operator set a backstop — the derivation is otherwise
authoritative and binds below it in practice.
- doc/context-limits.md: intro + "After #62" rewritten as "After #67 —
the neuron computes its own limit" (formula, live signals, config
block, opencode note, NEURON_MAX_PROMPT_TOKENS demotion).
Remaining (phase 5b, follow-up): enforce the *derived* input as the
prompt cap (reject above computed input, not the static
NEURON_MAX_PROMPT_TOKENS) so VRAM-tight hosts can't accept an
OOM-inducing prompt. Needs a per-model cached cap read from the sync
validate path; scoped separately. Until then the static cap remains the
enforced backstop (advertised <= enforced holds when the env is set).
fmt/clippy/test green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The neuron now self-derives and advertises limit{context,input,output}
per loaded model; cortex forwards it and stops consulting the
operator-declared catalogue limit (which can't track hot-swapped models
or live capacity). Operator-set `cost` still flows from the catalogue.
neuron:
- CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]).
- LoadedHandle::derived_limit(): profile + live tightest-card free VRAM
(single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate
EMA (bootstrap until first sample) → derive_limit. None for arches
without a context profile. No operator clamp here (advertise the honest
derived value; the clamp is an enforcement-side backstop).
- list_models() fills ModelInfo.limit from derived_limit (was None).
- derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as
"no VRAM ceiling" instead of collapsing to zero.
cortex:
- ModelEntry gains `limit`, copied from ModelInfo.limit by the poller.
- /v1/models: catalogue `limit` no longer flows (Pass 1 sets None);
Pass 2 adopts the neuron's limit, taking the tightest across neurons
via tightest_limit(). cost unchanged.
- model_limits.rs rewritten: catalogue limit (999999) is ignored; the
neuron's ModelEntry.limit is advertised; cost still from catalogue.
- All ModelEntry literals updated with the new field.
fmt/clippy/test green; CUDA paths type-checked in CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refs #67. Feeds the throughput ceiling a live, per-model prefill rate
instead of only the configured bootstrap estimate, so the advertised
limit tracks real prefill speed and rises automatically as prefix
caching (#11) reduces effective prefill cost.
- context_limit::PrefillRateEma: lock-free f64-bits EMA (alpha 0.3),
ignores degenerate samples, None before the first sample. Unit-tested.
- prefill_rate field on LoadedModel + TpLoadedModel.
- Recorded as total-prompt-tokens / prefill-elapsed in the two streaming
serving paths (TP: inference_tp_stream via tp_for_task; single-GPU:
stream_inference_via_worker via a new &prefill_rate param threaded from
loaded_for_task). Measuring total prompt (not just the divergent
suffix) means a prefix-cache hit shrinks elapsed while the prompt stays
large, so the effective rate — and the ceiling — rises toward the VRAM
ceiling, exactly the #11 payoff.
Per the agreed scope, non-streaming + CPU paths fall back to the
bootstrap estimate (opencode streams; those paths rarely carry the
fleet). fmt/clippy/test green; CUDA paths type-checked in CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refs #67. Captures the per-model context physics at load and adds the
live free-VRAM signal the derivation needs — the tightest card across TP
ranks, not just the leader.
- ContextProfile captured at load:
- single-GPU dense CUDA path (world_size 1) via
context_limit::profile_from_qwen3_5_config(config_path, ..);
- TP path (world_size = tp_size) at TpLoadedModel construction.
GGUF/CPU/non-qwen3_5 → None (fall back to the static prompt cap).
New `context_profile` field on LoadedModel + TpLoadedModel.
- profile_from_qwen3_5_config(): reads config.json (mirrors
VisionMeta::from_config_path), counts full_attention layers
(layer_types authoritative, full_attention_interval fallback), builds
the per-card KV cost via the shared helper.
- Folded the inline per-rank KV-bytes math in tp_qwen3.rs (both
cuda/non-cuda log_construction_complete) and tp_qwen3_5.rs onto
context_limit::kv_bytes_per_token + KV_CACHE_DTYPE_BYTES.
- Per-rank VRAM fan-out (tightest card):
- WorkerRequest::QueryVram + WorkerResponse::VramInfo { free_mb, total_mb };
- worker.rs handle_query_vram (cuda: mem_get_info; non-cuda: error);
- WorkerPool::query_vram_tightest_free_mb fans out to every rank
(leader via its device worker, subprocess ranks via RPC) → min free;
- TpLoadedModel::query_vram_tightest_free_mb convenience wrapper.
No advertise/enforce yet (phases 4/5). fmt/clippy/test green; CUDA paths
type-checked in CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refs #67. The correct limit{context,input,output} for a deployment is a
computed function of model architecture + live free VRAM + a
coherence/throughput trade-off, not an operator-declared static fact that
goes stale on model swap. This lands the arch-agnostic derivation core;
later phases capture per-model physics at load, measure throughput, and
advertise/enforce the computed limit.
- crates/neuron/src/harness/context_limit.rs (new):
- kv_bytes_per_token(): shared per-card KV cost (counts only
full-attention layers; sharded by TP world size). The TP load paths'
inline math folds onto this in phase 2.
- ContextProfile: per-model physics snapshot (max_position_embeddings,
kv_bytes_per_token_per_card, world_size).
- derive_limit(): context = min(max_pos, vram_ceiling,
throughput_ceiling) clamped by an optional backstop; input = context −
output; rounded to 1024. 6 unit tests.
- config.rs: [harness.candle.context_limit] block (mirrors prefix_cache):
target_prefill_latency_secs, bootstrap_prefill_tok_per_sec,
activation_headroom_mb, min_free_floor_mb, output_reserve_tokens.
- neuron.example.toml: documented the new block.
No runtime behaviour change yet. fmt/clippy/test green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Closes#64.
opencode meters reasoning tokens separately via the OpenAI-standard
detail objects, which neuron's usage structs didn't expose. Add them
additively so older clients ignore them.
- cortex-core: Usage gains completion_tokens_details/prompt_tokens_details;
ResponsesUsage gains output_tokens_details/input_tokens_details. Optional
+ skip_serializing_if, so the wire shape is unchanged for non-reasoning
models. cached_tokens fields are defined but always None until prompt
caching lands (#11).
- candle.rs: count tokens generated while in_reasoning across all three
streaming paths (TP, worker, CPU); carry the count on InferenceEvent::Finish.
- chat projector: populate completion_tokens_details.reasoning_tokens.
- responses projector: wire up base usage emission on the streaming path
(it emitted none before) and add output_tokens_details.reasoning_tokens.
- non-streaming paths leave details None (they don't track in_reasoning).
reasoning_tokens is a sub-count of completion/output tokens (OpenAI
semantics) — not added into total_tokens.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cortex resolved the catalogue path "models.toml" relative to the service's
working directory, so the systemd-launched binary never found
/etc/cortex/models.toml and ran with an EMPTY catalogue in production —
limits, cost, pinning, aliases and feasibility were all silent no-ops,
with models surfacing only via the neuron poller. Tests never caught it
because they pass models_config explicitly; only the defaulted,
packaged path was broken.
Default to the absolute /etc/cortex/models.toml (where cortex.spec installs
it) and document the override in cortex.example.toml. Restores the #62
limit/cost advertisement (the catalogue is now actually read) along with
pinning/aliases/feasibility.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolves#62. opencode's helexa provider discovers a model's serving
budget from /v1/models and uses it to size context, trigger compaction,
and show spend with no hand-configuration. Each model entry now carries:
- limit { context, input?, output } — operator-declared in models.toml
- cost { input, output, cache_read?, cache_write? } — USD per 1M tokens
- tool_call / reasoning — runtime-detected by the candle harness and
OR-ed in from each serving neuron
Composition: the catalogue profile supplies limit/cost (Pass 1); the
poller carries the neuron's detected tool_call/reasoning into ModelEntry,
which the gateway unions onto the entry (Pass 2); aliases propagate every
field (Pass 4). Wire types extend ModelInfo / ModelProfile /
CortexModelEntry additively (serde default + skip_serializing_if), so
older neurons and clients are unaffected. helexa-bench's ModelInfo
constructor and the gateway test fixtures are updated for the new fields.
Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost
(catalogue) and tool_call + reasoning (runtime), and that max_model_len
is gone.
Removes max_model_len. It was write-only with no consumer — opencode's
source references it nowhere and it is not an OpenAI /v1/models field —
and doubly misleading: vLLM's max_model_len means total sequence length,
but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap.
The limit{} contract replaces it. The neuron's max_prompt_tokens remains
the enforced prompt cap (neuron-side); cortex just stops re-advertising a
derived, mis-named copy. Closes#66 — its stale-max_model_len premise is
moot once the field is gone.
limit/cost are operator-declared (catalogue) per #62's design; auto-
deriving the advertised budget from each neuron's reported cap is a
tracked follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Roll the per-model context cap into deploy.yml so it is deterministic per
host and rolled out (with a restart) alongside the rest of the service
config, rather than hand-edited in local.conf. The deploy now writes
/etc/systemd/system/neuron.service.d/model.conf from a new per-host
`max_prompt_tokens` matrix field, and restarts a neuron when the package
OR the drop-in changes — so a cap change applies even with no new RPM.
beast (Qwen3.6-27B, hybrid linear, 2x 32GB) -> 131072 (~128k); benjy and
quadbrat (dense, VRAM-bound) stay at 16384 but become deploy-managed.
Adds the scoped sudoers grant for the root-owned drop-in install, and
doc/context-limits.md documenting the knob relationships and KV/VRAM math
(refs #62 for the eventual /models-advertised source of truth, #65 for
the length-aware text VRAM guard that gates pushing beyond 128k).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fixes#63
Standardize error messages by adding type, code, and param fields to
align with OpenAI API format. Updates include:
- Structured error envelopes with broad type categorization
(invalid_request_error/api_error)
- Specific machine-readable codes (model_not_found/service_unavailable)
- Null param field as required by OpenAI specification
- Consistent error response formatting across handlers, proxy, and
routing layers
New tests verify correct error envelope structure for various failure
scenarios.
Co-Authored-By: Helexa (Qwen3.6-27B, 48k context) <noreply@helexa.ai>
InferenceError responses were a flat `{"error": "..."}` string. OpenAI
clients (opencode, the openai SDK) reach into `error.type`/`error.code`
to drive behaviour — most importantly `code == "context_length_exceeded"`
triggers auto-compaction + retry instead of a hard failure. A flat string
is invisible to that logic.
Rewrite `inference_error_response` to emit the nested envelope
`{"error": {"message","type","code","param", ...diagnostics}}` and map:
- ModelNotLoaded → 404 invalid_request_error / model_not_found
- PromptTooLong → 400 invalid_request_error / context_length_exceeded
(message: "maximum context length is N tokens", + prompt_len/max)
- InsufficientVram → 503 api_error / insufficient_vram
- VisionUnsupported→ 400 invalid_request_error / vision_unsupported
- TemplateRenderFailed → 422 invalid_request_error / template_render_failed
- Other → 500 api_error / null code
Diagnostic extras ride inside the error object so the envelope shape is
stable. Both inline match blocks in the chat-completions handler
(streaming + non-streaming) now defer to the shared helper, which the
responses handler already used — one source of truth.
Adds 4 unit tests covering the envelope shape and codes. Also fixes a
pre-existing clippy lint (cloned_ref_to_slice_refs) in qwen3_5 snapshot
test surfaced by a newer clippy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The deeper reason opencode showed "Context: 0 tokens / 0% used" and flew
into a 400: streaming responses carried NO `usage`. Clients track context
(and trigger compaction) from the `usage` field; the legacy candle
streaming path set `usage: None` on every chunk, so a streaming client
had no token count at all — `max_model_len` alone is a denominator with
no numerator.
InferenceEvent::Finish now carries prompt_tokens + completion_tokens
(the streaming loops already have both: prompt_tokens.len() and the
generated all_tokens.len()). The openai_chat projector emits an
OpenAI-style trailing usage chunk (empty `choices`, populated `usage`)
after the finish chunk. cortex's Anthropic stream translator already
reads chunk.usage, so this fixes context tracking on BOTH the OpenAI
(opencode) and Anthropic (Claude Code) paths.
Also harden the max_model_len plumbing's sibling: cortex re-polls
/discovery while a neuron's max_prompt_tokens is still 0 (unknown), so a
rolling-deploy race where cortex caches discovery before the neuron has
the field self-heals instead of pinning max_model_len to None until a
manual cortex restart.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
opencode (and any OpenAI/Anthropic client) couldn't size or compact its
context against helexa because /v1/models never advertised a context
window — opencode showed "0 tokens / 0% used" and flew straight into a
400 PromptTooLong once a conversation + a fetched 64KB log overflowed the
49152-token cap. Compaction is the client's job, but the client needs to
know the limit to do it.
neuron now reports its effective prompt cap (NEURON_MAX_PROMPT_TOKENS)
in GET /discovery (`max_prompt_tokens`). cortex surfaces it on
/v1/models as `max_model_len` (vLLM / OpenAI-compatible convention) per
model — the smallest cap among the neurons that can serve it
(feasible_on ∪ locations), so the advertised limit holds wherever the
request routes. A neuron reporting 0 predates the field and is treated
as unknown (skipped); models with no reporting neuron omit the field.
helexa still rejects over-limit prompts with a clean 400 — this just
gives clients the number to compact *before* hitting it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
opencode (OpenAI path, /v1/chat/completions passthrough) hit the same
chat_template:120 failure Claude Code did — "cannot convert value into
pairs" — because the OpenAI wire format carries
tool_calls[].function.arguments as a JSON *string*, while Qwen3.6's
template iterates it as a dict (`arguments | items`). The Anthropic-side
fix (8880b2f) only covered cortex's translation; the OpenAI path reaches
neuron unchanged.
render_chat_template now normalizes string-form tool-call arguments to
objects across all messages before building the Jinja context, so OpenAI
and Anthropic clients both render. Object args (Anthropic path) pass
through untouched; a string that doesn't parse is left as-is and the
render fails loudly (422 TemplateRenderFailed, a94dd55) rather than
silently dropping tools.
The loud-fail change earned out immediately here: opencode got a clean
422 with the exact `chat_template:120` cause instead of a degraded
session.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three of this session's bugs (system-message position, tool_call argument
shape, and the original tool rendering) all hid behind the same silent
behaviour: chat_template render fails → neuron falls back to
format_qwen3_prompt, which drops every tool → the request still returns
200 with degraded, tool-less output. Each cost real debugging time
because the failure was invisible on the wire.
build_prompt_for_request now returns Result. On a render failure it
checks whether the request carried tools: if so it returns the new
InferenceError::TemplateRenderFailed (mapped to 422 with a
template_render_failed code and the underlying Jinja error), instead of
silently degrading. A render failure with no tools still falls back
quietly — there's nothing to lose, and `format_qwen3_prompt` is a
reasonable text-only prompt. The four prompt-build call sites propagate
with `?`.
Now the next client/template incompatibility surfaces as a loud 422 the
operator sees immediately, not a mysteriously-degraded session.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified live via the rendered-prompt trace: once a tool call is in the
conversation history, the Qwen3.6 chat template fails to render —
render chat_template: invalid operation: cannot convert value into
pairs (in chat_template:120)
because line 120 iterates `tool_call.arguments | items` (treats arguments
as a dict), while cortex emitted the OpenAI-standard JSON *string*. On
that render error neuron silently falls back to a tool-less prompt, so
the model loses every tool the moment it makes one call — it can make the
first tool call, read the result, then can only narrate ("now let me
check the runs") and stop, because the next turn has no tools. That's the
"drops the ball a little later" symptom: the CC trace shows the get_me
turn rendering 42653 tokens (tools present) and every subsequent
tool-history turn falling back to ~6k tokens (tools gone).
anthropic_to_openai now passes `function.arguments` as the parsed object
rather than stringifying it. Tests updated to expect the object form.
This is the same silent-fallback failure class as the system-message
merge (295b10c) — which is why making neuron's template-render fallback
LOUD (4xx on a tools-bearing request instead of a degraded 200) is now
clearly worth doing: it would have surfaced both in seconds.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified live: Qwen/Qwen3.6-27B with a simple prompt and max_tokens=400
generated 400 tokens, finish_reason=length, and 0 visible characters —
the model spent the ENTIRE budget on <think> reasoning, which we then
drop for OpenAI/Anthropic clients (include_thinking=false), starving the
visible answer. This is why Claude Code "dropped the ball": empty or
truncated responses. A/B confirms the cause — same prompt with
chat_template_kwargs.enable_thinking=false yields a full 545-char answer.
The earlier prompt_opens_reasoning fix stopped the reasoning *leaking* as
text but left it consuming the token budget. Couple the two: when the
caller isn't going to see the reasoning (include_thinking=false, the
default), default chat_template_kwargs.enable_thinking to false so the
model doesn't generate it. An explicit client enable_thinking wins;
thinking-aware clients (helexa-acp, x-include-thinking: true) keep
reasoning on. Tests cover the default (false), surfacing (true), explicit
override, and preservation of other kwargs.
Note: only the /v1/chat/completions path (what Claude Code uses via
cortex /v1/messages); /v1/responses could get the same defaulting as a
follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified live via neuron trace: Claude Code's real requests carry a
top-level `system` AND a `role:"system"` turn inside `messages`. cortex
passed the latter through at a non-first position, and Qwen3.6's chat
template hard-rejects it:
WARN chat_template render failed; falling back to format_qwen3_prompt
error=... invalid operation: System message must be at the beginning.
On that render error neuron silently falls back to a template that
renders NO tools, so the model got zero tool-format guidance and
improvised an unparseable `<tool><name>…` syntax — tool calling broke
entirely for real CC traffic, even though synthetic single-system
probes (and the earlier translation/parse fixes) worked.
anthropic_to_openai now accumulates the top-level `system` plus every
`role:"system"` conversation turn and emits a single system message at
index 0, with the non-system turns following in order. Reproduced the
trigger (system-role message at index>0 → fallback) and the fix
(merged → template renders tools). Test covers the merge + ordering.
Secondary hardening worth a follow-up: neuron's silent template
fallback drops tools without surfacing it to the client — a render
failure on a tools-bearing request should arguably 4xx rather than
degrade invisibly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Debugging tool-call format drift (Qwen3.6-27B emitting wrapper-less
<tool><name>…> under Claude Code's real system prompt + 120-tool list,
which neuron's <tool_call> detector can't parse) needs ground truth on
what the model actually sees. neuron logged nothing about the rendered
prompt. Add a trace! in build_prompt_for_request emitting the full
rendered prompt + char count + tool count, so we can see whether the
chat template's <tool_call> format instruction survives a large system
prompt and how the tools render. Gated at trace (the prompt can be tens
of KB): RUST_LOG=neuron::harness::candle=trace.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A neuron-blackwell build hung ~90 min (siblings finished in 2) and there
was no job timeout to kill it, so it sat burning a runner. Root cause of
the hang: the inline retry loop treated every failure identically and, on
its final attempt, rebuilt with sccache disabled. When the real failure
is a rustc SIGSEGV or an OOM-kill, an uncached rebuild does *more* work
under the same memory pressure — turning one transient compiler crash
into a wedged job.
Two fixes:
1. timeout-minutes on every job in build-prerelease.yml and ci.yml
(builds 25, neuron CUDA build/cuda-check 35, packaging 20, COPR 60,
fast jobs 10-15). A hang now dies in minutes, not hours.
2. New script/ci-cargo-escalate.sh replaces the five (prerelease) + three
(ci) inline escalation loops. It classifies the failure:
- signal death (exit >=128, or cargo reporting `signal: N`/SIGSEGV/
SIGKILL) → compiler crash, NOT an sccache fault: keep the cache,
one warm retry, then fail fast. Never escalate to uncached.
- sccache fault (recognisable sccache error) → restart the server,
retry, then one final uncached attempt.
- deterministic compile/test error → fail fast (no wasteful retry).
It also folds in the CUDA-image sccache probe the neuron/cuda-check
jobs did inline. Classification verified locally against success,
plain failure, exit-139, and the cargo-wrapped `signal: 11` form.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Qwen3.6's chat template injects the opening <think> into the generation
prompt, so generation begins mid-thought and the open marker is never
sampled. The streaming loops flipped in_reasoning to true only on a
*generated* open token, so they stayed in text mode and streamed the
model's reasoning out as visible text — verified live: a tool request
returned a 255-char text block of chain-of-thought ("The user wants to
know the weather… I will construct the function call now.") ahead of the
tool_use block, with the trailing </think> stripped (close token
recognised) but no opening <think>.
Each streaming loop now seeds in_reasoning by replaying the prompt's
reasoning markers (new `prompt_opens_reasoning`): if the prompt ends
inside an open <think>, the loop starts in reasoning mode, the thinking
routes to ReasoningDelta (dropped by the chat projector's default
include_thinking=false, which is what cortex uses), and the model's
</think> flips back to visible text for the answer/tool call. Template-
agnostic and self-correcting: a prompt that doesn't open reasoning (no
think injection, enable_thinking off, non-reasoning model) starts false,
preserving current behaviour. Thinking is hidden, not disabled, so answer
quality is unaffected.
Applied to all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming). Test covers
open/close replay, multi-turn closed state, reopen-at-tail, and the
no-pair pass-through.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified live (commit d662fa2 logs): cortex now delivers OpenAI-shaped
tools to neuron correctly, but Qwen3.6-27B emits tool calls in the
Qwen-XML form inside the <tool_call> markers —
<tool_call>
<function=get_weather>
<parameter=city>
Brno
</parameter>
</function>
</tool_call>
— while parse_tool_call_body only did serde_json::from_str expecting
{"name":…,"arguments":…}. It returned None, the dispatch re-emitted the
raw block as a text delta, and clients saw the markup as prose. cortex
logged upstream_tool_calls=false finish_reason="stop".
parse_tool_call_body is now format-tolerant: JSON first (Qwen3-Instruct
/ Hermes), then a Qwen-XML parser (Qwen3-Coder / Qwen3.6). Each
<parameter> value is coerced to its declared JSON type using a new
ToolSchemas map built from the request's tools (string stays string,
integer/number/boolean/object/array coerced, mistyped values fall back
to string so an argument is never dropped). build_tool_schemas is
threaded into all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming).
Each loop also tracks emitted_tool_call and promotes the terminal
finish_reason from Stop to ToolCalls when a call parsed, so the OpenAI
chunk carries finish_reason:"tool_calls" and cortex maps it to Anthropic
stop_reason:"tool_use" — without which an Anthropic agent (Claude Code)
sees a tool_use block but stop_reason:end_turn and may not run the tool.
FinishReason::ToolCalls drops its dead_code allow.
Tests: JSON form still parses; Qwen-XML multi-param parse with
schema-driven string/integer/boolean coercion; no-schema type sniffing;
type-mismatch string fallback; unparseable body returns None.
Known gap (separate): the non-streaming run_inference paths have no
tool-call handling at all; Claude Code streams, so the streaming loops
are the ones that matter here.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude Code (ANTHROPIC_BASE_URL -> cortex) hits POST /v1/messages, but
anthropic_to_openai forwarded the request's `tools` array verbatim via
the flattened `extra`. neuron feeds that straight into the HF chat
template, which iterates the OpenAI shape (tool.function.name/.parameters).
Anthropic-shaped tools ({name, description, input_schema}) rendered as
broken/empty definitions, the model improvised an unparseable
<tool_use_name>...</tool_use_name> tool-call format, neuron's
<tool_call>{json}</tool_call> detector missed it, and the markup fell
through as plain assistant text — so CC never received a structured
tool_use and the agent loop died.
Request-side translation now reshapes:
- tool definitions: {name, description, input_schema}
-> {type:"function", function:{name, description, parameters}}
- tool_choice: auto->"auto", any->"required", none->"none",
tool->{type:"function",function:{name}}
- assistant tool_use blocks -> OpenAI assistant.tool_calls
(arguments JSON-stringified) — fixes multi-turn
- user tool_result blocks -> standalone role:"tool" messages keyed by
tool_call_id
- system content blocks flatten to text instead of being JSON-serialised
into the prompt; best-effort image-block -> image_url part
Wire-debug instrumentation (tracing levels only; cortex/neuron ship at
info, operator infra runs at debug):
- every handler emits a debug! "inbound request" line tagging the wire
surface (anthropic | openai-chat | openai-responses | openai-completions)
plus model/stream/tools and, for Anthropic, tool_history/system
- response side reports upstream_tool_calls + finish_reason, streaming
and non-streaming
- full inbound + translated-upstream bodies at trace! (UTF-8-safe, capped)
Tests: 8 request-side unit tests + an end-to-end gateway test asserting
the upstream neuron receives OpenAI-shaped tools and a
user->assistant(+tool_calls)->tool->user history.
Also tighten script/infra-log-verbosity.sh: independent cortex/neuron
RUST_LOG args, cortex-only by default (neuron restart behind
--with-neuron so we don't needlessly cold-reload models), mkdir -p the
drop-in dir, symmetric RUST_LOG cleanup, and set -euo pipefail.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Public visitors don't know the hostnames, so surface each host's GPU(s)
as the resource name across the UI.
- store: gpu_label() turns the stored gpus_json into a compact label
("2× RTX 5090", "RTX 4090"); add `gpu` to ReportRow + RunRow and
`host_gpus`/`model_gpus` maps to /api/dimensions (from each one's
latest run). render_json gains gpu too.
- UI: Overview + Runs show a "GPU" column (gpu, fallback host); Runs'
filter is now GPU-labelled (still filters by host underneath); Trends
shows a "Measured on <gpu>" line for the selected model.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Public visitors don't know the hostnames or per-host hardware, so the
host picker on Trends was confusing. Select by model + scenario only;
/api/series now takes host as optional and resolves it to the host
serving that (model, scenario) — coherent since each model maps to one
host today. Runs (drill-down) keeps its host filter.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a dashed vertical ReferenceLine at the first live build (labelled
"bench.py → helexa-bench") so the intentional gap between the gateway
baseline and the direct-to-neuron series reads as a deliberate
measurement-regime change, not missing data. The two series stay
unconnected by design (different regimes, not directly comparable).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Option C: a curated static baseline (bench/src/baseline.ts), transcribed
from doc/benchmarks.md (8f6f1d3 + a1952a4 post-#11), overlaid on the
Trends charts as a dashed, clearly-labelled historical series ahead of
the bench era. Host inferred from model via the doc's fleet table;
ordered by snapshot time so it anchors the timeline.
Kept deliberately separate from the live series (no DB/API change) — the
baseline is a different regime (bench.py through the cortex gateway,
medians only) so it's never merged into the direct-to-neuron line; a
caption spells out the distinction.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- cert_present() must `sudo test -d /etc/letsencrypt/live/...` (root-only
0700); without sudo it falsely reported "no cert" and downgraded the
bench.helexa.ai vhost to the http-only bootstrap (dropping its 443
server). Now correctly keeps the full TLS vhost.
- bench.internal initial cert: rsync the operator's JWK 'lair' provisioner
password to the host transiently (root, 0600), issue via
step ca certificate, then remove it (trap + belt-and-suspenders rm).
Verified: bench.helexa.ai (LE) and bench.internal (lair CA) both serve the
SPA + /api→bob; step@bench.timer renews; secret removed from host.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Inside the WireGuard mesh, bench.helexa.ai dead-ends at the OPNsense LAN
interface (only WAN :443 is port-forwarded), so add an internal path:
- asset/nginx/bench.internal.conf — server_name bench.internal, internal
"lair" CA cert, same SPA + /api→bob proxy. Mirrors the *.internal vhost
convention on oolon.kosherinata.internal.
- asset/systemd/step@.{service,timer} — replicate oolon's smallstep cert
renewal (step ca renew via mTLS, every 15 min, reload nginx).
- infra-setup.sh: install the step@ units + /etc/nginx/tls/{cert,key},
install the vhost + enable step@bench.timer once the cert exists; prints
the one-time issuance command otherwise.
Initial cert issuance (JWK provisioner) and bench.internal DNS are
operator steps.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The webroot/http-01 approach needed nginx serving :80, but the gateway's
nginx was dormant. Switch to the host's established convention —
certbot --dns-cloudflare --key-type ecdsa with /root/.certbot-internal —
which needs neither nginx nor :80, so the cert provisions independently
of the vhost being served. Also restorecon the webroot (SELinux
enforcing → nginx 403 without httpd_sys_content_t), and only ever
install the full TLS vhost once the cert exists (http-only bootstrap
otherwise) so `nginx -t` always passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nginx on the gateway serves the bench SPA and reverse-proxies /api to the
bob bench API over WireGuard — public, auth-less, same-origin (no CORS),
internal API stays private.
- asset/nginx/bench.helexa.ai.conf (full TLS vhost: SPA + /api proxy) and
a bootstrap http-only vhost for the initial ACME challenge.
- infra-setup.sh: one-time gateway setup — webroot, Let's Encrypt cert
(certbot webroot, idempotent), install + enable the vhost.
- deploy.yml: deploy-bench-ui builds the SPA (setup-node) and rsyncs
dist/ to /var/www/bench.helexa.ai every deploy; built same-origin so
no VITE_API_BASE.
- cortex-host.conf: scoped gitea_ci rsync grant for the webroot.
- bench/README: production hosting notes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Part A — helexa-bench read API:
- [api] config (enabled, listen :13132); WAL on the store so API reads
never block the sweep writer.
- store read methods: summary, series (chronological per-build medians),
runs (filtered), dimensions, run_count.
- api.rs: axum /api/health|dimensions|summary|series|runs, permissive
CORS (UI is a separate origin). The `run` daemon binds the API
alongside the sweep; new `serve` subcommand serves API-only.
- listener plumbing (bench gains a port): data/helexa-bench-firewalld.xml,
spec install, deploy-bench /api/health probe + firewalld step, sudoers
firewall-cmd grants, [api] in example + bob.toml.
- 5 API tests + serve smoke.
Part B — bench/ Vite + React-SWC-TS app (router, react-bootstrap,
recharts): Overview (summary table), Trends (decode tok/s & TTFT across
build SHAs), Runs (filterable explorer). Typed API client with
VITE_API_BASE + dev proxy to bob. npm build/typecheck clean. Hosted
separately from the API (per design); .gitignore excludes node_modules/dist.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a deploy-bench job to deploy.yml that rolls helexa-bench onto bob
(the bench host, also running Agent Zero), following the deploy-cortex
pattern: manifest-gated skip-when-current, light "service stays active"
validation (outbound-only, no listener/model to probe), journal capture.
Runs alongside the cortex→neurons chain (no deploy-ordering dependency —
the sweep loop is version-aware).
Boot persistence: all systemd deployments now `systemctl enable --now`
instead of bare `start`, so cortex / neuron / helexa-bench come back
after a host reboot. Covers deploy.yml (all three services) and
deploy-dev.yml (neuron fast path); sudoers gain the matching
`enable --now <svc>` grant.
infra-setup.sh handles bob: provisions gitea_ci, installs the
bench-host sudoers, enables the lair-cafe-unstable repo (bob is a client
host without it), pre-creates /etc/helexa-bench, and syncs
asset/helexa-bench/bob.toml. New assets: bench-host.conf sudoers and
bob.toml (three neuron targets).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(bench): version-aware benchmark harness + neuron build metadata
Adds GET /version build metadata to neuron and the helexa-bench crate — a continuous, version-aware harness that records fleet benchmarks into SQLite keyed by neuron build SHA, replacing manual bench.py runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds automated, longitudinal performance tracking across neuron builds,
replacing manual script/bench.py runs and hand edits to benchmarks.md.
neuron build metadata + GET /version:
- cortex-core: shared BuildInfo type (build_info.rs).
- neuron build.rs captures git SHA (preferring injected HELEXA_BUILD_SHA,
else git, else "unknown"), dirty flag, build timestamp, rustc version,
profile, target, enabled cargo features, and best-effort candle-core
version from Cargo.lock.
- New GET /version endpoint (version.rs) + clap --version long form.
- SHA injected in CI (build-neuron step) and helexa-neuron.spec
(%{?helexa_commit}) so tarball RPMs report the real SHA. /version is
now the canonical "which build is live" probe.
helexa-bench crate:
- Continuous daemon: hits each neuron directly on :13131, exercises each
warm (status==loaded) model, records every run into a SQLite
system-of-record stamped with the neuron's full BuildInfo.
- Version-aware: skips any (target, build SHA, model, scenario) cell
already at samples_per_version, so a steady fleet costs only cheap
/version + /models polls until a new SHA ships.
- Extensible Scenario trait; phase-1 chat-latency family ported verbatim
from bench.py (synthetic 128/4096-tok prompts, /no_think, streamed
TTFT + decode-window tok/s). `report` regenerates the benchmarks table.
- kind="openai" comparison targets scaffolded, not yet wired.
Packaging: data/helexa-bench.service (+ sysusers), prebuilt-binary RPM
spec (outbound-only, no firewalld), and build/package/publish wiring in
build-prerelease.yml with change detection.
Tests: cortex-core BuildInfo round-trip, neuron GET /version integration,
helexa-bench unit (prompt/SSE/config/store) + end-to-end sweep
(record -> skip -> resume on new SHA). Docs updated (benchmarks.md,
CLAUDE.md addendum).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The neuron fleet builds with `cuda cudnn flash-attn`, but nothing in
neuron uses flash-attn: the qwen3_5 (27B) arch is hand-rolled, the
candle-transformers qwen3 model has no flash path, llama is built with
use_flash_attn=false, and `grep flash crates/neuron/src` is empty. The
feature only pulls in candle-flash-attn's sm_80/sm_86 CUDA kernel
sweep — which is exactly where ptxas SIGSEGVs/hangs in #42 (3 hits in
one day, the last a ~4-hour hang that stalled the whole deploy behind
the ampere job).
Dropping the feature removes the #42 failure surface at the root (not
a mitigation) and cuts the longest, most fragile part of each flavour
build. No runtime change — nothing called those kernels. Removed from
all three flavour builds in build-prerelease.yml and from deploy-dev.yml;
ci.yml's cuda-check already used `--features cuda` only.
Closes#42
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First phase of speculative decoding: the pure, state-free acceptance
logic and per-target config, unit-tested in isolation before the
draft/verify loop and GDN-state rollback wire it into the generation
path.
greedy_accept walks the drafter's K proposed tokens against the
target's greedy token at each of the K+1 positions, accepting the
longest matching prefix and always committing one bonus token on top
(the target's correction at the first mismatch, or a free extra token
when the whole draft matched). So a round commits 1..=K+1 tokens —
never zero, guaranteeing forward progress even with a useless drafter.
Greedy is exact for temperature-0 (the fleet probe + #22 bench
regime); stochastic acceptance is a later phase.
SpeculativeConfig carries the drafter id (must share the target's
tokenizer — Qwen3.5-0.8B for the Qwen3.6-27B target, both qwen3_5,
byte-identical tokenizer, confirmed on beast) and the draft length K.
6 unit tests: full accept, partial accept, zero accept (progress
guarantee), last-position mismatch, single-token draft, config
gating. Not yet wired into the decode path — phase 2 (single-GPU
draft/verify) follows. Design + phasing on the issue.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The single-GPU vision path was still single-shot: a long vision-bearing
prompt to a single-GPU-loaded qwen3_5 had the OOM exposure the TP path
shed in fa01350 (it was only guard-rejected, never served).
Mirror TpQwen3_5ForCausalLM::prefill_with_images_chunked onto the
single-GPU Qwen3_5ForCausalLM: encode the image(s) once, walk the
pre-expanded prompt in prefill_chunk_tokens() windows splicing the
per-chunk <|image_pad|> rows, accumulate KV + GDN state across chunks
via the growing offset, keep the last chunk's logits. Interleaved
M-RoPE positions are computed once over the whole prompt and sliced
per chunk (an image compresses the position space, so per-chunk offset
arithmetic would be wrong) — so Qwen3_5Model::forward_inner gains an
explicit position_ids path alongside the internal-from-grids
(single-shot) and plain (text/decode) paths, plus a forward_with_positions
entry point. The device-worker ForwardLogitsWithImages handler now
calls the chunked method; chunk size comes from prefill_chunk_tokens()
on the worker thread, so the Job/handle surface and the callers are
unchanged.
The shared validate_vision_prefill VRAM/KV backstop stays (TP keeps it
too) — chunking bounds activation memory, not the accumulating KV
cache, so the guard still does useful work.
Verified on real weights (Qwen3.5-0.8B): extended the #15 vision
reference test to also run the chunked path with chunk_size=64 over the
217-token prompt (4 chunks; the ~196-token image-pad run spans them).
Chunked vs single-shot logits: cosine 1.000000, max_abs 0.0001;
argmax matches the HF reference. The test covers all three
forward_inner branches (text plain / single-shot vision / chunked
vision) on a real single-GPU qwen3_5 load.
Closes#18
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
doc/plan/* is gitignored, so the P1 learnings briefing could never be
committed. Move it to doc/learnings/p1.md (verbatim) and add
doc/learnings/p2.md capturing the P2 sprint (#11/#23/#1/#15).
The P2 doc's headline: CI green != correct. Four correctness bugs
passed every CI gate and surfaced only on the live fleet (post-gen
snapshots never re-match reasoning models; full-prompt snapshots
break on BPE retokenization; the chunked delta-rule's nilpotent-
squaring shortcut NaNs on correlated keys; the 0.8B masked two of
these by luck). Plus the device-worker/TP state patterns, the
deploy-dev + systemd-drop-in A/B loop, the per-package change-
detection fleet-split failure mode (#42), and the f32-fixture
numerical-validation rig (#15).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
script/dump_reference.py captures fixtures from the HF qwen3_5
implementation (token ids + reference tensors, f32 by default so the
comparison pins math rather than dtype noise);
tests/numerical_reference.rs replays them through our arch and
asserts argmax equality, cosine similarity, and max-abs ceilings. The
tests self-skip without NEURON_REF_MODEL_PATH so CI stays green
without weights.
Measured on beast (f32-vs-f32): text logits max_abs 0.000 / cosine
1.000000 (the >64-token prompt routes through the chunked GDN
prefill, so the production prefill math is what's validated); vision
tower cosine 0.999998, end-to-end vision logits cosine 1.000000 with
identical argmax. Mutation sensitivity: NEURON_VISION_LEGACY_POS=1
collapses tower cosine to 0.75 and fails loudly.
One production fidelity fix the harness surfaced: the pos-embed
bilinear blend now accumulates in f32 and casts once at the end,
matching the reference (we previously rounded the weights to bf16
before blending).
Fixtures: 0.8B text + vision (f32), 27B text (bf16 — an f32 27B
forward needs ~108 GB; the automated comparison runs against the
0.8B, which executes the same arch modules). Regeneration documented
in tests/fixtures/numerical/README.md.
Closes#15
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
QTensor::quantize runs its per-block math strictly sequentially on
one core (CUDA storage round-trips through the same CPU path), which
made Q6K ISQ the dominant phase of the 27B TP cold load. Blocks are
independent, so quantize_parallel re-implements the same encoding
through candle's public per-block API (k_quants::GgmlType::from_float)
with rayon fanning blocks across the CPU pool — byte-identical output,
pinned by parity tests against QTensor::quantize for Q6K/Q5K/Q4K/Q8_0.
Threading discipline holds: the device-to-host read and the
QStorage::from_data upload stay on the calling thread (device worker /
subprocess main); rayon workers touch host memory only.
Also adds the per-phase timing the issue asked for first: per-layer
debug + layer-loop total + lm_head info lines, so the next cold load
shows where the time actually goes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Live A/B on beast produced NaN logits ("!!!" replies) on real prompts:
the nilpotent-squaring form of (I - T)^-1 computes raw powers of T,
whose entries grow combinatorially (path counts ~ C(62,31)) before
nilpotency collapses them — fine on uncorrelated test data, f32
precision death on real prompts whose repetitive text makes keys
highly correlated. The reference's forward-substitution loop never
forms raw powers; its intermediates are the convergent M entries.
Port the reference loop faithfully (rows accumulate into a fresh
tensor). New adversarial parity test with near-identical keys and
beta ~= 1 diverges to 8e30 under the squaring form and passes under
forward substitution.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Prefill (seq_len >= 64) now runs the chunk-parallel gated delta rule
ported from the HF reference torch_chunk_gated_delta_rule
(chunk_size=64): identical math reorganised into per-chunk batched
matmuls (cuBLAS/tensor cores on CUDA, gemm on CPU) instead of the
O(L)-sequential per-token recurrence. Decode steps and short prompts
keep the recurrent paths (CUDA kernel / Rust loop) unchanged.
One deliberate deviation from the reference: its in-place row-by-row
UT-transform computes (I - T)^-1 - I by forward substitution; T is
strictly lower triangular and therefore nilpotent at chunk size 64,
so the same inverse is the product of six squarings
prod_{j=0..5}(I + T^(2^j)) — batched matmuls instead of 63 sequential
row updates, which suits candle's immutable tensors. Chunk-local math
runs rank-3 over a flattened B*H*N batch dim (candle matmul supports
at most two batch dims).
Initial-state continuation is supported, so chunked prefill composes
with #11's restored prefix snapshots. Both single-GPU and TP paths
pick this up through the shared run_delta_rule dispatch.
NEURON_GDN_CHUNKED=0 forces the recurrent paths for A/B measurement.
Parity tests pin chunked against recurrent (2e-4 abs) across padding
(L=130), exact multiples with non-zero initial state (L=128 after a
50-token prefix), and a single exact chunk.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Appends the 2026-06-12 post-prefix-cache run: 27B @4k warm TTFT
7.07 s -> 1.43 s, no-cache control models unchanged, with a
methodology note that repeated-prompt cells now measure warm TTFT on
qwen3_5-arch models.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Second finding from live 27B validation: prompt-covering snapshots
still never matched. The rendered prompt ends with
`<|im_start|>assistant\n`, and when the next turn re-tokenizes that
text followed by the assistant's reply, BPE merges the trailing
newline with the reply's first characters — the final token(s) of the
cached sequence differ from the next prompt's, so the exact-prefix
match never fires. (A reply starting with an atomic special token
like <think> masks this, which is why the 0.8B check passed.)
Snapshot one past the last <|im_start|> instead: special tokens are
hard segmentation points, so ids up to and including it are provably
identical across renders. Prefill pauses at that boundary to capture
the snapshot, then finishes the ~2-token `assistant\n` tail. Applied
to all six request paths; unit tests for the cut helper.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Live validation on beast's Qwen3.6-27B showed reused=0 on every turn:
the post-generation snapshot includes reasoning tokens (<think>...)
that get stripped when the client echoes the assistant message back,
so the cached sequence is never a token-prefix of the next prompt.
quadbrat's 0.8B only matched because its think block round-tripped as
literal text.
Snapshot after prefill instead (covering exactly the prompt tokens) —
that is the state the next turn provably extends under a stable chat
template, regardless of how reasoning or tool-call content is
transformed on echo. Taken after the first healthy sample so
NaN-poisoned prefills never cache their state; this also retires the
forwarded-token bookkeeping and the consumer-hangup store sites.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two errors only the cuda config surfaces: the TpSnapshotKv dispatch
arms mixed candle and anyhow error types, and restore_or_clear_tp held
the registry MutexGuard across the cleanup await inside a let-chain
(making the TP request futures non-Send). Bind the removed ref before
awaiting, same discipline as the other lock sites.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Extends the prefix cache to tensor-parallel models — Qwen3.6-27B on
beast, where the TTFT win is largest. Closes#11.
Every rank holds its shard's snapshot under one pool-minted id: the
leader's lives in the device worker beside the TP slab
(Job::TpSnapshotKv / TpRestoreKv / TpDropKvSnapshot), each subprocess
rank stores its own in-process via new WorkerRequest variants
(SnapshotKvCache / RestoreKvCache / DropKvSnapshot). Shard state has
the same shape as single-GPU (attention ConcatKvCache + GDN
conv/recurrent state + rope_delta), so the snapshot types are reused;
all ranks sit at the same token boundary because step fan-out is
synchronous.
Consistency on partial failure: a failed restore falls back to
clear-all-ranks + full prefill (and drops the entry); a failed
snapshot drops the id on every rank so nothing half-stored leaks.
DropTp / UnloadModel invalidate a model's snapshots with it, covering
auto-recovery. Vision requests bypass as on single-GPU. Budget
accounting uses leader bytes x world_size (shards are symmetric).
Wired into both TP request paths (non-streaming inner + streaming
orchestration task); chunked_prefill_tp gains the restored-offset
start.
Closes#11
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stop discarding cache state between requests. When an incoming
prompt's token sequence starts with the exact tokens of a stored
snapshot, restore it and prefill only the divergent suffix.
For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v
+ GatedDeltaNet conv/recurrent state + the rope_delta counter, all at
one token boundary; the recurrent state cannot rewind, so matching is
exact-prefix only. GDN states are deep-copied both directions (the
CUDA delta-rule kernels mutate the state buffer in place); attention
k/v snapshots share storage safely (append-by-cat never mutates).
Snapshots live in the device worker's state next to the model slab
(Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds
only an opaque id + token sequence + byte size. DropArch drops a
model's snapshots with it, so unload and auto-recovery invalidate for
free. CPU loads hold snapshots inline on the legacy path.
Per-model LRU registry (harness/prefix_cache.rs) bounded by
[harness.candle.prefix_cache] budget_mb / max_entries, enabled by
default; inserting a snapshot drops entries it strictly extends.
Vision requests and candle-transformers archs bypass the cache
entirely (clear-every-request, unchanged).
Covers the single-GPU worker path (streaming + non-streaming) and the
CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR
that closes#11 with before/after bench numbers.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The /v1/messages handler translated request envelopes but proxied raw
OpenAI SSE frames back to streaming Anthropic clients — the gap
between the README's "point your tooling at it once" contract and
what Claude Code actually received.
cortex-core gains AnthropicStreamTranslator, a pure per-stream state
machine: OpenAI chunks in, ordered (event, payload) pairs out —
message_start → content_block_start/delta/stop (text and tool_use
blocks, indexed; tool_calls map to input_json_delta) → message_delta
(stop_reason mapped via the now-shared map_stop_reason, which also
teaches the non-streaming path tool_calls→tool_use) → message_stop.
Without an upstream usage frame the output count falls back to the
delta count (engine-exact for neuron's one-chunk-per-token streams,
#31); with one, input/output tokens ride message_delta.
cortex-gateway gains anthropic_sse: the wire pump that splits the
upstream byte stream into SSE events, parses data: payloads
(leniently — engines omit fields on special frames), feeds the
translator, and frames results as `event:`/`data:` pairs through a
bounded channel (slow client back-pressures the upstream read).
Upstream truncation without [DONE] still closes the Anthropic event
sequence. Nothing is buffered beyond the current event's bytes.
Tests: 5 state-machine unit tests (text flow, stop-reason mapping +
defaults, tool_use blocks, usage propagation, idempotent finish) and
2 gateway integration tests (full event sequence + text reassembly,
usage propagation into message_delta). Validated end-to-end by
running this branch's gateway against a production neuron and
streaming a live Anthropic request.
Closes#24
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
script/bench.py: stdlib-only, works against any OpenAI-compatible /v1
endpoint (helexa, llama.cpp, Ollama, vLLM) so cross-engine tables are
a concatenation via the --label column. Measures the operator-felt
trio per (model, prompt-size) cell: TTFT (first SSE content chunk),
decode tok/s (visible tokens over the first→last chunk window,
chunk-per-token engine invariant since streaming usage frames aren't
emitted yet — #31), total wall-clock. Medians over N runs after one
warmup; append-only JSONL for longitudinal tracking.
Measurement traps found against the live fleet and handled:
- thinking models burn the budget invisibly (reasoning deltas are
off-wire by default) — the prompt appends Qwen's /no_think soft
switch
- short coalesced replies collapse the decode window to one TCP read
— rates require a ≥200 ms window and the prompt demands ~300 words
doc/benchmarks.md: method, fleet table, and the first published
numbers (2026-06-12, 8f6f1d3): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s,
27B@2×5090 Q6K TP=2 35 tok/s with flat decode from 128→4k context —
and the 7.1 s 4k-prefill TTFT recorded as #23's before-number.
Refs #22 (competitor baselines still pending — the harness is ready
for them)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A deploy previously went green the moment systemd reported the
service started — a merge that broke model loading or inference
itself would deploy "successfully" and only surface when a human
noticed. Each neuron deploy now earns its green:
1. Wait for default models: poll /health until activation.state is
ready, with per-host timeouts in the matrix (beast 900s for the
27B Q6K TP=2 cold-load, benjy/quadbrat 300s). Any entry in
activation.failed fails the deploy with the per-model error —
the structured equivalent of watching the journal for
"loaded default model", plus failure detail the journal line
can't carry.
2. LLM smoke probe: ask the first loaded model to reply with one
specific word (max_tokens 512 so thinking models have room,
temperature 0) and grep the response for it. Not a quality bar —
just proof the deploy didn't lobotomize inference.
Hosts whose package is already current still skip everything — the
validation cost is only paid when a restart actually happened. The
probe was dry-run against benjy's production neuron before landing.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.
The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.
Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
decode window (first→last chunk); falls back to total request
duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
(counters)
The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).
Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).
Closes#21
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The un-rebooted driver update (userspace libs bumped, kernel module
still old) kills every CUDA call on the host including nvidia-smi,
and neuron surfaced it only as `Comm::from_rank ... NcclError` deep
inside the first model load — 30 minutes of forensics on beast
(2026-06-08) to diagnose. Make it instantly legible instead:
- discovery distinguishes nvidia-smi absent (CPU-only, fine) from
present-but-failing, classifies the "Driver/library version
mismatch" signature, and pairs the userspace NVML version with the
loaded kernel-module version from /proc/driver/nvidia/version.
- DiscoveryResponse gains `cuda_unavailable_reason` (omitted when
None — wire-compatible) so cortex can see why the node has no
devices and route around it.
- startup logs one loud ERROR line with the actionable reason
("reboot the host to reload the kernel module") and skips default
model loads entirely, marking each failed with that reason so
/health activation shows the real cause.
- POST /models/load fast-rejects with 503 + code=cuda_unavailable on
a mismatch host instead of dying minutes later in cuInit/NCCL.
No false positives: other nvidia-smi failures (no devices, perms)
keep their existing behaviour, CPU-only hosts stay silent.
Closes#19
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run 375 proved the CUDA image ships sccache (probe step printed
"sccache enabled") but the wrapper never reached cargo: the runner
does not propagate GITHUB_ENV across steps, so the builds ran
unwrapped (server stats: 4 compile requests for a ~600-crate build,
durations unchanged). Probe and export inside the build step's own
shell instead, in both build-neuron and ci.yml's cuda-check.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The f5fa840 deploy exposed both failure modes of gating with
`dnf check-update` as the gitea_ci user in one run: it hung
indefinitely on quadbrat (blocked process, 0 CPU, killed manually),
and on benjy/beast it silently reported "no updates" two minutes
after new RPMs were published — both hosts skipped a real (luckily
binary-identical) update.
Gate with data we own instead: fetch packages.json from
rpm.lair.cafe (plain curl, no privileges, no dnf locks), take the
newest release per package by buildTime, and skip the
stop/upgrade/start cycle only when it exactly equals
`rpm -q %{VERSION}-%{RELEASE}`. Unreachable or unparsable manifest
fails open to a full deploy. The dnf transaction itself still runs
under the scoped sudoers rules, unchanged.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The 3 CUDA flavour builds (10-14 min each, the critical path of every
full run) and build-cortex compiled entirely uncached. With the
gongfoo-side sccache hardening in place, wire them up:
- build-cortex: full sccache env (rust image ships it) + the standard
escalation loop (retry -> server restart -> uncached final attempt).
- build-neuron: probe for sccache before enabling the wrapper — the
CUDA image may not ship it, and a missing binary must degrade to an
uncached build, not fail cargo at `sccache rustc -vV` (the original
reason the wrapper was cleared here). rustc compilations are shared
across all three flavours; candle-kernels' nvcc output stays
uncached (build-script artifact).
- ci.yml cuda-check: same probe pattern replaces the blanket env
clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer
ships nvidia-smi for candle-kernels' fallback detection (mirrors
9bb9678 on the #20 branch).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
candle-kernels' build script shells out to nvidia-smi for compute-cap
detection when CUDA_COMPUTE_CAP is unset; the current GPU-less builder
image doesn't ship it, so the type-check died in the build script
before borrow-checking anything. Pin an arbitrary valid cap — the
check is feature-gate compilation only; real caps live in
build-prerelease.yml's flavour matrix.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
During the #17 auto-recovery window (unload → reload, minutes for a
large TP model) the model's registry slot is absent, so it vanished
from neuron's /models — and cortex, routing by /models presence,
answered "model not found on any node" while a direct request to
neuron would have correctly said "recovering, retry shortly".
neuron: the recovery set becomes a map carrying a devices/capabilities
snapshot taken at trigger time (while the registry slot still exists).
list_models reports `recovering` for models in the set — both while
the poisoned slot is still present and during the reload gap, where
the snapshot keeps the model listed.
gateway: ModelStatus grows a Recovering variant (parsed from the
wire); the router holds the route — new RouteError::ModelRecovering
mapped to 503 instead of 404 — and deliberately does not fall through
to the catalogue cold-load, which would race a second placement
against the in-flight recovery. The evictor already ignores
non-Loaded entries.
Tests: neuron unit test (recovering model stays listed with snapshot),
gateway integration tests (poller parses `recovering`; request gets
503 retry-shortly and the model stays on /v1/models).
Closes#20
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run 361's Test job failed all 3 attempts with the sccache
dead-server signature (sccache fatal error, ENOENT on its own tmp
files under target/debug/deps). Retrying the same invocation only
helps for transient races; against a wedged server every same-VM
retry fails identically — and under the new pipeline that blocks
publish and the deploy behind it.
Escalate instead: attempt 1 plain, attempt 2 after an sccache server
restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache
now costs build minutes, never the deploy. Applied to the lint/test
jobs in build-prerelease.yml and ci.yml alike.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Push-to-testable was ~20.5 min for every commit (measured on the
2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of
what changed. Three structural fixes:
- build-prerelease: a change-detection step in `prepare` diffs HEAD
against the git sha embedded in the last *published* unstable RPM
(per package, from packages.json) and skips builds whose inputs
didn't change. Docs-only commits build nothing; gateway-only
commits skip the 3 CUDA flavour builds. Detection failures fall
open to a full build.
- ci.yml no longer runs on pushes to main; fmt/clippy/test live in
build-prerelease as parallel jobs gating publish. The two workflows
previously queued against each other on the same runner labels,
delaying the cortex build ~12 min. Branches, PRs, and tags keep the
full ci.yml gate.
- deploy: each host self-gates with `dnf check-update` and leaves the
service untouched when the installed package is already current —
no more neuron restarts (and 27B cold-loads) for commits that
didn't change neuron.
- deploy-dev (new): manual single-host fast path — build one CUDA
flavour, scp the binary, restart the service. Skips packaging,
signing, publish, and dnf entirely. Backed by a new exact-form
sudoers rule in asset/sudoers.d/neuron-host.conf (already applied
to all three hosts).
Expected loop times when runners behave: docs ≈ 1 min (nothing
deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min,
full fleet ≈ 13-15 min.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lead with what helexa is for — near-frontier open-weight models on
consumer hardware you own — instead of a feature list. Adds the scope
section (intentional divergence from vLLM/SGLang; CUDA-only today as a
test-coverage constraint, not a principle), an engine section covering
the per-device worker threads and consumer-GPU tensor parallelism, the
previously-missing helexa-acp crate, and a status section pointing at
git.lair.cafe as the source of truth with GitHub as read-only mirror.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
helexa is the project; cortex (per-operator control plane / LLM proxy)
and neuron (per-host LLM harness) are its components. The Gitea repo
is now helexa/helexa. Update repository URLs in Cargo metadata, RPM
specs, and docs; make the CI changelog push URL rename-proof via the
github.repository context; reframe README.md and CLAUDE.md around the
project name. Binary, package, service, and config-path names are
unchanged.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's
thread-safety invariant upstream) and makes `Comm::abort` idempotent via
an `aborted` flag (so abort-then-Drop can't double-free) — strictly
better than the previous Drop-no-panic workaround, and the `abort()`
signature is unchanged so the watchdog call site is unaffected.
Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` /
`NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119)
with neuron's manual `unsafe impl`s. Remove the four now-redundant impls
— the safety assertion lives upstream in cudarc where it belongs. The
conflict is in cuda-gated code, so only the CUDA type-check catches it
(non-cuda build + clippy + tests stay green).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No code change. Each deploy run, the degraded CI runner kills a different
single arch build (blackwell, then ada) ~fast, and the all-arch-gated
packaging skips → no publish. Every arch HAS built green across runs
(blackwell ✅ in 342, ampere ✅, ada ✅ in 339) and the gate + CUDA
type-check pass. Re-running to catch all three green in one run so the
Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate
infra work).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No code change. The c94a2ae deploy's neuron-blackwell build died ~12min
into the Blackwell kernel compile on the degraded runner, while
neuron-ampere + neuron-ada built the identical Rust + patched cudarc
cleanly and the CUDA type-check passed. Transient infra; re-running to
get a healthy blackwell build so the RPMs publish and beast (Blackwell)
picks it up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state`
(nonexistent); the module is the child `nccl_state` (cf. the existing
`nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the
non-cuda build couldn't catch it; the branch CUDA type-check flaked on the
runner before compiling. Self-audited fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make a hung NCCL collective recoverable instead of a permanent brick.
Today a wedged collective hangs the in-process leader thread forever, and
even Stage 1's recovery can't help — its unload's DropTp queues behind the
stuck thread and hangs too.
- Cache the leader's NCCL Comm handle async-side at init (new cuda-gated
Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on
WorkerPool.leader_comm). Fetched while the thread is responsive — a
wedged thread can't service the fetch, which is why it's cached up front.
- Wrap the leader forward in both generate_step and
generate_step_with_images in tokio::time::timeout (default 120s,
NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls
Comm::abort() (ncclCommAbort) on the cached handle from the async
thread — the one NCCL op sanctioned concurrently with an in-flight
collective — which unblocks the leader thread, then fails the step
WITHOUT draining (workers are wedged too; recovery's unload kills them).
The error is a device fault → poison → Stage 1 auto-recovery, which now
completes because the leader thread is responsive again.
- Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so
the post-abort comm teardown during recovery doesn't double-abort-panic).
Logs the whole sequence at ERROR with greppable `tp watchdog:` /
`ncclCommAbort` markers so a real-world hang leaves a forensic trail —
verification is by inspecting journals after real hangs, not a synthetic
harness. cuda-gated → validated by the blackwell build.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE
communicator from another thread — to unblock a collective wedged on a
dead/hung peer so the ranks can resync. No cudarc release (incl. main)
exposes this: the safe Comm only aborts in Drop, which can't fire while a
stuck thread holds an Arc<Comm> clone.
Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort,
rev 4dff0be) adding three thin methods — Comm::abort, get_async_error,
and a raw comm() accessor — to be submitted upstream. The patch targets
0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io.
Foundation only; the watchdog + abort + comm-rebuild that consume these
land in follow-up commits (cuda-gated → validated by the blackwell build).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No code change. The abc6e60 deploy's neuron-ada build died on the
degraded CI runner (container dropped mid-checkout), skipping the
gated publish — even though neuron-blackwell + neuron-ampere compiled
the Stage-1 fault-recovery code cleanly. Re-running to get a healthy
ada build so the RPMs publish and beast picks up the build.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One-shot, env-gated fault injector for beast verification: when
NEURON_DEBUG_POISON names a model, the first request for it triggers the
auto-recovery path as if a device fault had occurred — exercising
unload→reload→healthy without corrupting the GPU. Latched so it fires
exactly once (no recovery loop). No-op unless the env var is set; wired
into both the single-GPU and TP chat poison gates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.
- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
`LoadedHandle::spec()`) so a poisoned model can be reloaded without an
operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
`new()` when a runtime is present) drains poisoned model ids and runs
`unload_model` → `load_model(spec)`. Unload drops the model → cudarc
`Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
init + sanity inside the load path, so a successful reload yields a
fresh, healthy model. A failed reload leaves it unloaded (next load
retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
per model via a `recovering` set) and return a transient "recovering,
retry shortly" error instead of the manual-reload message. Requests
that arrive during the brief reload gap (model absent from the registry)
also get "recovering" rather than a misleading "not loaded".
`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.
Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Beast testing surfaced a real regression in the dynamic-resolution
default: a tall 808×1600 image resized (within the 1024² max_pixels) to a
90×44 patch grid = 3960 patches, exceeding the vision tower's hard
`num_position_embeddings = 2304` pos-embed budget. The per-rank
`patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP-
forward and poisoned the device context, bricking the model until reload.
Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches →
≤ 576 LM tokens), clamping even the operator env override. `smart_resize`
floors the pixel count under the cap, so no resized image can ever exceed
the budget — the tower check never fires, no poison. The pos-embed grid
(48×48) is the resolution Qwen3.6 was trained at, so the cap is
principled, not just defensive. Still ~3× the old fixed 196 tokens, and
the book-cover OCR test (1176 patches) already reads full title+subtitle.
Test: a huge/tall/wide/extreme image battery stays within the 2304 patch
budget. (Per-rank-error poison robustness itself remains issue #17.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS /
NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the
NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain
256²…1024² (64…1024 LM tokens/image).
- Test: a max-resolution source caps within the token budget (can't blow
NEURON_MAX_PROMPT_TOKENS).
- Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from
the preprocess, mod, and rope doc-comments now that resolution is
dynamic and M-RoPE is implemented.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the fixed 448×448-square preprocess with native-aspect
`smart_resize`, and thread the resulting per-image grid through the LM
so spatial structure survives non-square images (documents, screenshots,
charts, panoramas, OCR) instead of being squished into a square.
- preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32;
pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens).
`PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/
`max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized
`(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for
the TP leader's token count.
- rope.rs: `compute_mrope_index`/`get_rope_index` take per-image
`grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`.
Walk image runs in order, validate `run == gh*gw`, emit row-major
positions, resume the shared counter at `base + max(gh,gw)`. Correct
for multiple images of differing grids interleaved with text.
- candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the
`image_grid_factor` (patch×merge) instead of the constant 196; all four
prompt-build sites compute per-image counts from each image's resized
grid (single-GPU from the extracted `ImageInput.h/w`, TP from
`resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`.
- single-GPU (`mod.rs`, `dispatch.rs`) and TP
(`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`,
`tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank
recomputes grids from its own deterministic preprocess — no rpc.rs
change, single source of truth.
The vision tower itself was already grid-general (recent pos-embed
interpolation + 2D rotary fix). No patch-count cap: pos-embed is
interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT
attention + prefill) instead.
Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index`
non-square + two-image + mismatch cases, square-grid regression guard.
Non-cuda build + clippy + full workspace tests green; TP load/dispatch
paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget
config + remaining doc cleanup follow in C5.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two fixes to the spatial handling of images, validated against the HF
transformers 4.57.1 qwen3_vl reference on beast.
**Vision tower (the real cause of poor spatial vision).** The Stage-A
tower encoded position two ways wrong, so the model saw image *content*
but not *layout* (a row of 5 people read as "a line of 23", sky
inverted), regardless of the LM-side rope:
- Learned pos-embed was a naive sequential lookup of the first
`n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid —
wrong stride for a 28×28 patch grid. Now bilinearly interpolates the
grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major.
- The 2D vision rotary was absent entirely. Added
`VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch
`(row, col)` rotary to q/k in every ViT block via rope_slow, matching
HF `apply_rotary_pos_emb_vision`.
Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1`
revert each for A/B (no rebuild). New unit tests: interpolation reduces
to the sequential lookup at the native grid; rotary row/col structure.
**M-RoPE default on.** The interleaved M-RoPE matches HF
apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥
plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in
— defaults should encode the model's trained behaviour, not freeze the
broken state.
Vision tower is plain candle (CPU-testable): built, clippy-clean, full
workspace tests green locally.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On beast the interleaved M-RoPE degraded image understanding rather than
fixing it: the model misread spatial layout (a horizontal row of people
described as a "diagonal receding line"), got attributes wrong, and
rambled — a "how many people" follow-up generated 4459 tokens over 3.5
minutes, past agent-0's HTTP timeout (the "fails to respond without an
error"). The interleave is evidently not numerically correct, and it
can't be validated remotely without a transformers reference.
Gate it: `get_rope_index` now returns plain sequential identity
positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to
plain RoPE and image tokens behave exactly as pre-M-RoPE (content
recognition works; spatial layout approximate; no rambling). The real
computation moves to `compute_mrope_index` (still unit-tested). Default
off restores the working vision and unblocks agent-0; the M-RoPE code
stays in place to debug + validate before flipping the default on.
Pure non-cuda change (rope.rs); both single-GPU and TP forwards call
the gated get_rope_index unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror Stage 3 into the tensor-parallel Qwen3.6 model:
- TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar
offset and apply via apply_cos_sin.
- TpQwen3_5Model gains the replicated rotary + rope_delta (reset in
clear_kv_cache, settable). forward_inner builds the cos/sin once —
interleaved M-RoPE from explicit position_ids (vision) or plain at
offset+rope_delta (text/decode). forward() and forward_with_positions()
delegate; the old single-shot forward_with_vision is gone.
- prefill_with_images_chunked now computes get_rope_index over the whole
prompt once, stores rope_delta on the base model, and slices the
(3, prompt_len) position tensor per chunk — so every rank assigns image
tokens their 14×14 grid coordinates and steps in lockstep (every chunk,
text or image, carries the M-RoPE slice because the image shifts the
surrounding text positions).
Also build the position-id tensor as f32 directly (positions are small
integers, exact in f32) to avoid an i64→f32 cast on the GPU.
The TP forward is cuda-gated — CI CUDA type-check is the compile gate.
Non-cuda build + clippy + full workspace tests green; rope math + the
plain-RoPE-reduction invariant covered by unit tests.
Completes the interleaved-M-RoPE work for the vision spatial misread.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Qwen3_5Model now builds the rotary cos/sin once per forward and threads
(cos, sin) through the decoder → full-attention → rope, replacing the
scalar offset that reached RotaryEmbedding:
- vision forward computes get_rope_index over the (single-shot) prompt,
sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens
carry their 14×14 grid (height/width) positions;
- text / decode take plain_cos_sin at offset + rope_delta — with
rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and
the device→host id copy is skipped on the text decode hot path.
rope_delta is stored on the model and reset in clear_kv_cache, so decode
after a vision prefill resumes text positions from the image-compressed
counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset;
linear-attention layers are unchanged (no RoPE). The TP path still uses
the retained apply(offset) — wired in Stage 4.
Full workspace tests green; the load-bearing invariant (M-RoPE == plain
for equal axes) keeps text unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure function computing the interleaved-M-RoPE 3D position ids for a
prompt with image-placeholder runs, plus the decode rope_delta:
text tokens advance a single counter (all axes equal); each image run
gets [base+t, base+h, base+w] row-major over a square grid_t=1,
grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from
base + max(grid). rope_delta = final_counter - seq_len lets decode
resume text positions after the position-compressed image blocks.
Plus mrope_position_tensor to build the (3, seq) tensor.
Unit tests: text-only is sequential (delta 0); text+image+text matches
hand-computed grid ids + resume + delta; 196 → 14×14; non-square run
rejected; end-to-end through mrope_cos_sin tracks the height axis.
#[allow(dead_code)] until Stage 3/4 wire it into the forward.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parse + store mrope_section / mrope_interleaved in RopeParameters
(previously accepted-but-ignored). RotaryEmbedding gains:
- inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section;
- plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode);
- mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the
interleave columns (vision);
- apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out.
The existing apply(q,k,offset) is retained (delegates to
plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages
3–4 move cos/sin construction into the model forward and thread the 3D
position ids for image tokens.
Tests: masks partition the half-dim; interleave drives the right axis
per column; and the load-bearing invariant — mrope_cos_sin reduces
bit-for-bit to plain_cos_sin when the three axes are equal (so text
inference is unchanged).
Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda;
no behaviour change until wired.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-0 sent a ~13k-token prompt + image; the TP vision prefill was
single-shot, so it tried to materialise activations for all 12,960
positions at once and OOM'd rank 1 mid-forward. Rank 1 died before
issuing its row-parallel AllReduce, stranding rank 0 on the collective
(it hung holding the pool lock). The text path survives the same size
because it chunks the prefill.
Chunk the vision prefill the same way:
- TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s)
once, then walks the pre-expanded prompt in prefill_chunk_tokens()
windows, splicing the patch-embedding rows into whichever chunk(s)
carry <|image_pad|> positions (pure-text chunks take the plain
forward). Activation is bounded by the chunk, not the prompt.
- Every rank runs the identical chunk sequence (chunk_size threaded
through GenerateStepWithImages / TpForwardLogitsWithImages /
generate_step_with_images), so the per-chunk AllReduces stay paired
across ranks with no extra sync — the KV cache accumulates via the
growing offset, only the last chunk's logits are kept.
Pre-flight guard (validate_vision_prefill): even chunked, a long
prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs
the collective. Reject up front with a clean InsufficientVram when the
estimated footprint exceeds free VRAM, so a doomed request fails fast
instead of hanging the daemon. Heuristic + tunable
(NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive
so the now-working 12,960-token case still passes. Applied to every
vision path (single-GPU + TP); single-GPU vision stays single-shot for
now, so the guard is its protection until it's chunked too.
Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size.
The chunked forward is cuda-gated — CI CUDA type-check validates it.
Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the
daemon (needs restart); making a worker failure abort the leader's
collective is separate, broader TP hardening.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Qwen3.6 chat_template.jinja (now loaded after the precedence fix)
failed to render in minijinja: it uses Python str methods
(content.startswith/endswith/split/rstrip/lstrip) and the raise_exception
global that HF transformers patches into its Jinja env but minijinja
doesn't provide. The render error tripped the text-only fallback, so
image requests still produced zero <|image_pad|> tokens.
Wire the standard bridge into render_chat_template:
- minijinja-contrib `pycompat::unknown_method_callback` supplies the
Python string/list/dict methods;
- a `raise_exception` global maps to a render error (so malformed inputs
— e.g. an image in a system message — surface cleanly).
Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF
cache) as a test fixture and assert it renders one <|image_pad|> for a
text+image turn — the end-to-end check that would have caught this
before deploy.
Refs #16 / TP-vision.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chat-template loader only read the `chat_template` field from
tokenizer_config.json. Qwen3.6-27B ships its vision-aware template
*only* in a standalone `chat_template.jinja` (and has no
tokenizer_config.json at all), so the loader returned None and image
requests fell back to the text-only format_qwen3_prompt — rendering
zero `<|image_pad|>` tokens and tripping
"expand_image_pad_tokens: prompt has 0 image_token_id occurrences".
load_chat_template_alongside now follows HF transformers precedence:
standalone chat_template.jinja → chat_template.json → the
chat_template field in tokenizer_config.json. Tests cover the
precedence, the text-only fallback, and that an OpenAI image_url
content part renders `<|image_pad|>` through the real template
condition (`'image_url' in item`).
Refs #16 / TP-vision.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CUDA type-check caught a non-exhaustive match: drain_poisoned()
must reply an error to every Job variant's reply channel, including the
new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't
see it — the variant is #[cfg(feature = "cuda")], so the match is
exhaustive without it on CPU.
Refs TP-vision plan Stage 2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now
conditions on the image across both ranks.
- TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image,
populated at load via the shared VisionMeta::from_config_path (same
config.json the shards loaded from; Stage 1 materialises the replicated
tower on every rank).
- LoadedHandle::capabilities() now advertises "vision" for TP loads with
a tower (cortex-gateway already unions this into /v1/models via C3).
- The TP rejection guards (chat_completion_tp + inference_tp_stream) are
now conditional on !has_vision — text-only TP models still 400 cleanly,
vision-capable ones fall through.
- chat_completion_tp_inner and the streaming orchestration task detect
images (request_has_images), expand <|image_pad|> to the per-image
patch count, and run a single-shot generate_step_with_images prefill
(every rank encodes + splices its replicated tower) before the
unchanged decode loop. Text requests keep chunked_prefill_tp.
- extract_image_data_uris ships the source data URIs to every rank for
identical per-rank preprocessing.
prompt_tokens now reflects the patch expansion, so usage accounting and
KV offsets match the single-GPU baseline.
TP entry points are cuda-gated (validated by CI's CUDA type-check);
capabilities() + extract_image_data_uris + VisionMeta reuse compile on
the non-cuda build. Full workspace test green.
Refs TP-vision plan Stage 3. Implements #12.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Carry image content through the TP forward path so every rank encodes
and splices locally (replicated tower, no embedding broadcast).
- rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source
image data URIs + image_token_id for the single-shot vision prefill;
worker still replies GenerateStepOk. Round-trip test added.
- tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each
preprocessed image through the rank's replicated tower, cat, splice,
forward. Shared by leader and worker so every rank runs identical work.
- tp/mod.rs: TpLeaderModel::forward_with_images and
WorkerPool::generate_step_with_images (mirrors generate_step: fan out
GenerateStepWithImages to subprocess ranks, run the leader's image
forward on its device worker thread, drain, combine).
- worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images
— each subprocess rank preprocesses the same data URIs via the shared
deterministic preprocess_data_uri, encodes, splices, forwards.
- device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images
dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images.
Determinism: every rank runs the same preprocess on the same source
URIs through the same replicated tower, so the spliced hidden state
matches across ranks — preserving the replicated-hidden-state invariant
the row-parallel AllReduce relies on, with no NCCL broadcast.
No caller yet — Stage 3 wires the TP chat/stream entry points to invoke
generate_step_with_images for image prefill. cuda-gated plumbing covered
by CI's CUDA type-check; rpc/route/forward_with_images compile on the
non-cuda build.
Refs TP-vision plan Stage 2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Load the full, unsharded model.visual.* vision tower on every TP rank
(leader + each subprocess worker mmaps the same local safetensors) when
config.vision_config is present. VisionTower::load already takes a
ShardedVarBuilder whose plain .get() returns the full replicated tensor,
so the tower loads identically regardless of world_size — no sharding,
no NCCL broadcast.
- TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id,
plus has_vision/image_token_id/encode_image/forward_with_vision,
mirroring the single-GPU Qwen3_5ForCausalLM wrapper.
- TpQwen3_5Model::forward_with_vision mirrors the single-GPU
forward_inner splice: embed locally, replace rows at image_token_id
positions, run the sharded decoder stack. Because every rank encodes
the same pixels through its replicated tower, the spliced input
embeddings are identical across ranks — preserving the TP
replicated-hidden-state invariant the row-parallel AllReduce relies on.
- splice_runs is now pub(crate) and shared with the TP model.
No caller yet — Stage 2 wires the RPC/worker path that invokes
encode_image + forward_with_vision per rank. Most of this compiles on
the non-cuda build (only the cuda load variant's tower line is gated);
CI's CUDA type-check covers the rest.
Refs TP-vision plan Stage 1.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A runtime scheduler lock was accidentally swept into the previous
commit by `git add -A`. Remove it from tracking (file stays on disk)
and ignore the whole `.claude/` dir so local agent runtime state never
lands in the repo again.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The TP inference path has no vision tower, and the TP dispatch in
chat_completion / inference_stream returns before the VisionUnsupported
guard runs — so an image request to a TP-loaded model (e.g. beast's
tp=2 Qwen3.6-27B) was silently dropped and answered from text alone,
the exact issue-#3 confident-hallucination pattern Stage C killed for
single-GPU.
Add the request_has_images → VisionUnsupported guard to both
chat_completion_tp and inference_tp_stream, before prefill / before the
SSE stream opens, so beast returns a clean 400 vision_unsupported. The
guard is unconditional for now (TP has no tower); Stage 3 makes it
conditional on the TP model's has_vision once real TP-vision lands.
Detection is covered by the existing request_has_images unit test; the
guard itself is cuda-gated (validated by CI's CUDA type-check).
Refs TP-vision plan Stage 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Responses request translator already emits the chat `image_url`
Parts array Stage B5's vision path consumes, and the non-streaming
(`chat_completion`) and streaming (`responses_stream` → `inference_stream`,
Stage C1) Responses paths both route image content to the vision-aware
prefill — so vision works end-to-end through `/v1/responses` with no
translator change required.
Add a multi-image test asserting order preservation and that the
`detail` hint is tolerated (and dropped, since chat image_url has no
analogue), locking the translator's output to the exact
`image_url.url` shape `extract_images_from_request` walks.
Closes part of #16 (Stage C2).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The streaming worker path now splices image embeddings on prefill,
closing the silent text-only degrade for `stream=true` image requests.
`inference_stream` gains the same vision-routing block as the
non-streaming `chat_completion`: detect `image_url` content, reject it
against text-only models with `VisionUnsupported` (before any SSE frame
is sent), preprocess each image and expand its `<|image_pad|>` sentinel
to the per-image patch count, then carry the payload through dispatch.
Rather than duplicate the 75-line `route_token!` reasoning/tool-call
state machine into a sibling streamer, `stream_inference_via_worker`
takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a
single-shot `forward_logits_with_images` splice; when `None`, the
original chunked text-only prefill. Image embeddings are prefill-only,
so every decode step stays on the plain `forward_logits` path and the
shared decode loop is untouched. This keeps exactly one copy of the
tool-call/reasoning logic to maintain.
The Responses API streaming path (`responses_stream`) inherits vision
for free since it drives the same `inference_stream`.
Unit test covers `request_has_images` (the shared routing gate); the
real-weights SSE smoke is the manual curl on beast (cuda-integration).
Closes part of #16 (Stage C1).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ModelEntry and CortexModelEntry gain a `capabilities: Vec<String>`
field (serde-default for back-compat). The poller copies it verbatim
from each neuron's ModelInfo.capabilities; list_models computes the
union across every node where a model is loaded so a checkpoint loaded
text-only on one neuron and text+vision on another reports both to the
fleet. Catalogue-only and mid-prewarm entries default to empty until
the catalogue gains a capabilities declaration.
Aliases inherit their target's capability union. New gateway test mocks
two nodes with differing capability arrays and asserts the unioned
/v1/models response.
Closes part of #16 (Stage C3).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After both `Start cortex.service` and `Start neuron.service`, sleep 10s
and run `journalctl --unit <unit> -I --no-pager` to record the latest
invocation's log in the workflow output. Step is guarded by
`if: always()` so a failed start still leaves a usable trace.
infra-setup.sh now adds gitea_ci to the systemd-journal group during
user provisioning, so `journalctl` works without a sudoers entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First end-to-end run of the deploy workflow succeeded (gitea run #289),
so the operator-run rolling-deploy script and its YAML manifest are no
longer the source of truth — fleet topology lives in
.gitea/workflows/deploy.yml and per-host config in script/infra-setup.sh.
Per-host neuron config comments updated to point at the new sync path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CUDA type-check in CI failed on commit 24968e9 with E0308:
error[E0308]: mismatched types
--> crates/neuron/src/harness/candle.rs:1707:33
1707 | images.clone(),
| ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`,
found `&Vec<ImageInput>`
In Stage B5 the cuda branch of `chat_completion` matches
`&vision_route` to keep the `vision_route: Option<...>` alive for
both arms, which makes `images` bind as `&Vec<ImageInput>`. The
subsequent `images.clone()` call doesn't deep-clone because
`ImageInput` doesn't derive `Clone` — rustc falls back to cloning
the `&Vec` reference, which has the wrong type for the worker job.
The CPU build (non-cuda) compiled fine because that branch is
behind `#[cfg(feature = "cuda")]`; the cuda-check job is what
catches the regression.
Fix: derive `Clone` on `ImageInput`. The clone cost is one
pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which
is fine on the chat-completion hot path — vision requests are
rare per second relative to text-only decode.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage B of the vision plan (doc/vision-qwen3_6-spec.md). Wires
the vision tower from Stage A through to a complete non-streaming
chat completion: extract images from the request, preprocess,
encode on the worker thread, splice embeddings into the LM input
at `<|image_pad|>` positions, return coherent text response with
`prompt_tokens` reflecting patch tokens.
Closes the silent-drop class of failures from issue #3 — vision
requests against Qwen3.6 now condition the model on the image
instead of producing confident text-only hallucinations.
Streaming for vision is Stage C. Deferred items tracked under
#12 (TP-vision), #13 (27B production), #14 (dynamic resolution),
#15 (numerical validation).
What landed:
- **B1 — `Qwen3_5Model::forward_with_vision`**: text-only `forward`
unchanged; new method takes `(input_ids, offset, image_embeds,
image_token_id)`, embeds tokens, locates `image_token_id`
positions, splices via the new `splice_runs` helper. MRoPE
applies text-positions to image tokens for Stage B (spatial
MRoPE is the issue #15 numerical-validation follow-up). 2 unit
tests for `splice_runs` covering contiguous + non-contiguous
runs.
- **B2 — `ModelArch::forward_with_vision` dispatch**: routes
Qwen3_5Dense to the new method; other arches return an error.
Defence-in-depth — the HTTP layer (B6) already rejects image
content for non-vision models.
- **B3 — `Job::ForwardLogitsWithImages`**: new worker variant
carrying tokens + per-image `(pixels, c, h, w)` payloads. The
dispatcher encodes each image (device-resident), concatenates
the resulting embeddings, calls `arch.forward_with_vision`, and
returns CPU logits. Image embeddings never copy back to CPU —
the "tensors don't escape the worker" invariant from the
per-device worker refactor still holds. Poisoned-worker drain
path handles the new variant.
- **B4 — Prompt builder**:
- `request_has_images` detects image content cheaply.
- `extract_images_from_request(request, profile)` walks
`MessageContent::Parts`, decodes data URIs, runs
`harness::preprocess::preprocess` per image, returns
`Vec<ImageInput>` in request order.
- `expand_image_pad_tokens(input_ids, image_token_id,
patches_per_image)` walks the tokenized prompt and replaces
each `<|image_pad|>` (id 248056 for Qwen3.6) with N copies
matching the per-image patch count. 4 unit tests.
- `VisionMeta::from_config_path` peeks `config.json` at load
time for `image_token_id`, vision_config patch/merge sizes,
and derives `lm_tokens_per_image` for the Stage B fixed
resolution.
- **B5 — `chat_completion` vision routing**: detects image
content, validates the loaded model has vision, expands the
prompt, and calls a new `run_inference_with_images_via_worker`
helper that does single-shot prefill + standard decode loop
(KV cache holds the post-splice hidden states from prefill, so
decode steps don't re-splice). Stage B skips chunked prefill
for vision — at 448×448 fixed resolution the budget stays well
under the activation-memory threshold. Long-vision chunking is
Stage D follow-up.
- **B6 — `InferenceError::VisionUnsupported`**: structured 400
with `code=vision_unsupported, model_id, suggestion` when an
image request hits a non-vision model. Closes the agent0
failure mode where vision requests degraded silently.
- **B7 — `ModelInfo.capabilities`**: per-model array (`["text"]`
vs `["text", "vision"]`) in `/v1/models` and forwarded verbatim
by cortex-gateway. Lets clients (litellm, agent0) gate
image_url submission on the declared capability set. Optional
in the wire format; defaults to empty for older clients.
CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
124 lib tests). New unit-test counts: +2 splice_runs, +4
expand_image_pad.
Manual verification (after RPMs deploy on beast):
curl http://hanzalova.internal:31313/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "{\"model\":\"Qwen/Qwen3.6-27B\", \"messages\":[{\"role\":\"user\",\"content\":[
{\"type\":\"text\",\"text\":\"What's in this image?\"},
{\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,...\"}}
]}], \"max_tokens\":120}" | jq
Expect prompt_tokens > 196 (text + 196 patch tokens) and a
response that references actual image content.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage A of the vision implementation plan
(doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding
that today's silent-drop failure mode (issue #3) needs — the
Qwen3.6 ViT loads from `model.visual.*`, runs forward producing
post-merger LM-side image embeddings, and routes through the
device worker via a new `Job::EncodeImage`. No LM splice yet —
that's Stage B.
Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision),
#13 (27B production deploy), #14 (dynamic resolution), #15
(numerical validation).
What landed:
- **A0 — investigation**: pulled config.json, preprocessor_config.json,
chat_template.jinja, and safetensors index from beast's local
Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with
exact tensor shapes for every `model.visual.*` weight. Confirms
27-block ViT with `hidden_size=1152`, `patch_size=16`,
`spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives
in 2 of the 15 safetensors shards.
- **A1 — deps + scaffolding**: added `image = "0.25"` (default-
features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to
crates/neuron/Cargo.toml. Created `harness::preprocess` and
`harness::arch::qwen3_5::vision` modules.
- **A2 — preprocess.rs**: `decode_data_uri` strips
`data:image/...;base64,...` → image bytes → `image::DynamicImage`
(rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess`
resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448),
normalises to `[-1, 1]` per the model's mean/std=0.5, emits
row-major `(3, H, W)` f32. 9 unit tests covering data URI parse,
decode failure paths, grayscale-to-RGB promotion, and the
exact-value normalisation contract.
- **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`,
learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN +
multi-head self-attention with fused QKV + GELU-tanh MLP +
residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat →
linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size).
Includes the Conv3d→Conv2d fold trick documented at the top of
the file — the published patch_embed.proj.weight is 5D
`(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static
images we sum-collapse the temporal axis. Video would need real
Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh`
reference values from PyTorch.
- **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config`
with optional `vision_config: Option<VisionConfig>` and
`image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision
tower when present, exposes `has_vision()` and `vision()` so the
HTTP layer can advertise capability and so the encode path can
reach it.
- **A5 — device worker `Job::EncodeImage`**: new job variant carrying
CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the
tensor on the worker's device, calls `arch.encode_image(image)`,
copies the result back to CPU as flat `Vec<f32>`. Keeps the
"tensors don't escape the worker" invariant. Poisoned-worker
drain path handles the new variant.
- **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_
and_errors_on_unknown_handle` proves the channel/dispatch wiring
works end-to-end via the CPU device worker (errors on unknown
ArchHandle, which is the expected behaviour without a loaded
model — real-weights validation happens in Stage B when the LM
splice path exists).
CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
zero failures). New test counts: +9 in preprocess, +5 in vision,
+1 in device_worker.
Out of scope (deferred):
- LM-side splice of image embeddings at `<|image_pad|>` positions
→ Stage B.
- Streaming SSE for vision-bearing chat completions → Stage C.
- Reject `image_url` with HTTP 400 for non-vision models /
advertise `capabilities` in /v1/models → Stage C.
- TP-vision (#12), 27B production deploy (#13), dynamic resolution
(#14), numerical validation (#15).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace operator-run script/deploy.sh with a CI-driven rolling deploy:
- .gitea/workflows/deploy.yml fires on build-prerelease success (and is
re-runnable via workflow_dispatch). Cortex upgrades first on
hanzalova.internal; the three neuron hosts upgrade in parallel under
fail-fast: false so one failing host doesn't sink the rest.
Concurrency-grouped to serialize overlapping deploys, never cancelling
in-flight runs (a half-applied dnf transaction is worse than a stale
deploy).
- asset/sudoers.d/{cortex,neuron}-host.conf are the canonical source for
the scoped privileges gitea_ci needs on each host kind, installed as
/etc/sudoers.d/helexa_gitea_ci. URLs and = signs are backslash-escaped
per sudoers reserved-character rules.
- script/infra-setup.sh idempotently provisions the gitea_ci user,
installs the runner pubkey, drops in the appropriate sudoers fragment
with visudo verification, and syncs cortex.toml / models.toml /
per-host asset/neuron/<short>.toml — config still ships from operator
workstations rather than CI because the first two are gitignored.
The CI-only secret is RSYNC_SSH_KEY (already configured for the repo);
the matching pubkey is ~/.ssh/id_gitea_ci.pub on the operator's box.
script/deploy.sh and asset/manifest.yml are left in place until the
first end-to-end deploy workflow run succeeds, then removed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 3 of plan-source-aware-loader-preflight. Adds an optional
`source` field to `ModelProfile` and threads it through the
router's cold-load path so a profile pointing at the helexa
registry forwards `helexa:<id>` to neuron's `/models/load`
instead of leaving neuron to substitute its `default_source`
(typically `huggingface`).
Without this, an operator who declares
`source = "helexa"` in models.toml would still see neuron fetch
from HuggingFace — the catalogue → ModelSpec translation in
`profile_to_spec` was dropping the scheme on the floor.
What lands:
- `cortex-core::catalogue::ModelProfile.source: Option<String>`.
None is the default and preserves pre-Phase-3 behaviour.
- `cortex-gateway::router::qualified_model_id(profile)` —
small pure helper, extracted from `profile_to_spec` so it can
be unit-tested. Empty-string `source` is treated as None so
operators who blank out a previously-set value don't trip a
scheme-with-no-scheme failure mode in neuron.
- `models.example.toml` documents the new field with a
commented-out helexa-scheme example pointing back at
neuron.example.toml's matching sources block.
Tests:
- 2 new unit tests in `cortex-core::catalogue`: source-absent
round-trip and source-present round-trip through TOML.
- 3 new unit tests in `cortex-gateway::router`: pass-through
when None, prefix when Some, pass-through on empty-string
source.
- ModelProfile literal in catalogue's existing test updated to
carry `source: None`.
CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace
(24 test groups ok, zero failures).
Completes Phase 3. With Phases 1+2+3 landed:
- neuron parses `scheme:org/name`, routes per-source hf-hub
Api with disambiguated cache.
- preflight returns structured errors before any device
allocation.
- cortex catalogue declares per-model source jurisdiction
and forwards it to neuron.
The registry itself (registry.helexa.ai service, MinIO,
nginx, mirror fabric) is the next moving piece — landing
under a separate project per the design discussion.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 of plan-source-aware-loader-preflight. Makes neuron's
loader treat `huggingface:org/name` and `helexa:org/name` as
first-class distinct sources with per-source endpoint + cache,
while staying backwards-compatible with bare `org/name` ids.
Zero behavior change for existing operator configs.
Motivation: helexa is adding an EU-hosted registry
(`registry.helexa.ai`) alongside HF. Both speak HF-compatible
wire format, but the bytes, jurisdiction, trust root, and cache
namespace are distinct. The loader needs to disambiguate which
registry serves a given model id, and to keep their caches from
colliding on disk when both happen to host the same `org/name`.
What lands:
- `cortex-core::source` — new module. `ModelSourceId { scheme,
org, name }` with `FromStr` accepting both `scheme:org/name`
and bare `org/name`. `Display` round-trips. `repo_path()`
emits the `org/name` half for the hf-hub `Api::model(...)`
call regardless of which scheme/endpoint we're hitting.
Rejects malformed input with typed `ParseError` variants
(empty scheme, missing slash, scheme with `/`, name with
`:`, etc.).
- `neuron::config::CandleHarnessConfig` gains
`default_source: Option<String>` and
`sources: HashMap<String, SourceConfig>`. `SourceConfig`
mirrors what `hf_hub::ApiBuilder` consumes: endpoint URL,
optional `auth_env` (env var name read at startup so secrets
stay out of TOML), and optional cache_dir. Defaults
synthesise a `huggingface` entry pointing at
`https://huggingface.co` with the legacy `hf_cache` field as
its cache_dir — so existing configs that only set `hf_cache`
keep working unchanged.
- `CandleHarness::new(bind_url, &CandleHarnessConfig)` replaces
`CandleHarness::new(bind_url, hf_cache)`. Resolves every
configured source's auth env var and cache dir up front so
`hf_api_for(scheme)` is a pure HashMap lookup on the hot
load path. Only the `huggingface` scheme gets the legacy
`HF_HUB_CACHE`/`HF_HOME` env-var fallback chain; other
schemes resolve to whatever the operator typed.
- `hf_api()` -> `hf_api_for(scheme)`. Builds an
`hf_hub::Api` with the source's endpoint, cache_dir, and
auth token. Errors with a useful message naming the
configured schemes when an unknown scheme is requested.
- `CandleHarness::load_model` parses `spec.model_id` into a
`ModelSourceId`, substitutes `default_source` for bare ids,
and threads the parsed source through `preflight`,
`resolve_files`, `resolve_dense_files`, `load_arch_gguf`,
`load_arch_dense`, and `load_tp`. The hf-hub `Api::model()`
call now uses `source_id.repo_path()` so registry calls hit
the right URL shape regardless of scheme.
- `preflight()` signature gains a `&ModelSourceId` parameter
(it's the canonical id for log lines and error display);
`RepoFetchFailed.model_id` etc. now carry the
scheme-qualified form so operator-visible errors echo
exactly what was configured.
- `neuron.example.toml` documents the new
`[harness.candle.sources.*]` table with commented-out
examples for `huggingface` (explicit override) and `helexa`.
Tests:
- 13 new unit tests in `cortex-core::source` covering parse /
display round-trip, default-scheme substitution semantics,
and every `ParseError` variant.
- 6 new unit tests in `neuron::config` covering the
`effective_sources` synth (legacy `hf_cache` carry-through,
explicit override preservation, helexa-alongside-huggingface)
and `effective_default_source` fallback.
- 2 new unit tests in `harness::candle::tests` covering
multi-scheme `hf_api_for` routing, including the
"unknown scheme" error path naming configured schemes.
- Preflight integration tests updated to construct
`ModelSourceId` and assert against the scheme-qualified
error form.
CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace (all 24
test groups ok, zero failures).
Out of scope (Phase 3):
- Cortex catalogue `source` field — independent of Phase 1+2,
ships when the registry comes online.
- `helexa` source endpoint itself — separate project; this
PR adds the client-side rails only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 2 of plan-source-aware-loader-preflight. Adds a one-RTT
placement feasibility check that runs before any device allocation,
NCCL handshake, or weight fetch. Replaces today's opaque
"fetch config.json … 404" failure mode (when an operator points
`tensor_parallel = 2` at a GGUF-only repo) with a structured
error that names the failure class and points at the fix.
What lands:
- `crates/neuron/src/harness/preflight.rs` — new module. Classifies
a repo's siblings listing into `SourceFormat` (Gguf | DenseSafetensors
| Mixed | Empty), applies the tp/quant feasibility table, returns a
`PlacementPlan` on success or a typed `PreflightError` on rejection.
`PreflightError` is `serde::Serialize` so the HTTP layer can emit
the structured shape verbatim; it's `thiserror::Error` so log lines
get a single-line Display when downcasting from anyhow. Includes
best-effort Levenshtein-nearest suggestion for malformed quant names
(the second sharp edge the HauhauCS scenario surfaced — operator
writes `q6k` against filenames containing `Q6_K_P`, and today's
matcher just says "no GGUF file matching quant").
- `CandleHarness::load_model` — calls `preflight(...)` first thing
after the "already loaded" guard, before any `ensure_device_worker`
or `resolve_*`. Failure wraps the typed error in `anyhow::Error` so
the existing trait surface is unchanged; the HTTP handler and the
startup logger downcast to recover the structured form.
- `crates/neuron/src/api.rs::load_model` handler — maps `PreflightError`
to 422 Unprocessable Entity with `{"error": {"kind": "...",
"model_id": "...", "suggestion": "..." }}`. Other failures keep
the existing 400 + free-form `format!("{e:#}")` shape.
- `crates/neuron/src/startup.rs::load_default_models` — when the
failure is a preflight rejection, log as `reason=<kind> detail=<msg>`
instead of the opaque `error=<chain>`, so journalctl on beast will
now show `reason=tp_requires_safetensors detail="repo is GGUF-only
(8 .gguf files); TP requires dense safetensors..."` instead of
`error=fetch config.json from HauhauCS/...: 404 Not Found`.
Tests:
- 18 unit tests in `harness/preflight.rs` covering classifier,
quant matching, Levenshtein, error serialization, and the full
feasibility table (gguf+tp rejected, gguf+bad-quant suggests
nearest, gguf+good-quant ok, dense+tp ok, empty rejected, mixed
prefers safetensors).
- 7 integration tests in `tests/preflight.rs` exercising the
network path through an axum mock that serves hf-hub-compatible
`/api/models/{org}/{name}/revision/main` payloads. Adds `tempfile`
as a dev-dependency for per-test cache dirs.
Out of scope (deferred to subsequent phases):
- Phase 1 (source-aware loader plumbing — `scheme:org/name` parsing,
per-scheme `SourceConfig`, cache disambiguation). Preflight runs
against the single configured HuggingFace source today; the scheme
threading lands cleanly when Phase 1 ships.
- Phase 3 (cortex catalogue source field).
- GGUF tensor-parallel loading. Preflight rejects this combination
with `TpRequiresSafetensors`; the underlying loader gap is the
separate `Helexa` curated-registry / heretic-rs conversation.
Refs #4-#9 architectural follow-up; no specific issue closed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Prerelease build (run 270) failed on commit cb30383 with:
error[E0107]: struct takes 5 generic arguments but 0 generic
arguments were supplied
--> crates/neuron/src/harness/candle.rs:3554:41
|
3554 | decode_stream: &mut tokenizers::DecodeStream<'_>,
| ^^^^^^^^^^^^
The Step-2-era refactor for #6's tool-call extraction added a
nested `async fn route_token` inside `stream_inference_via_worker`
that named `tokenizers::DecodeStream<'_>` as a parameter type.
`DecodeStream` actually has five generic parameters
(`'tok, M, N, PT, PP, D`) which makes naming it explicitly
painful — the working approach the CPU path uses is a macro,
where the body expands inline at the call site and the
decoder type stays inferred.
This commit replicates the CPU-side macro for the CUDA worker
path. Same shape, just with `.await` calls inside (macros tolerate
that since they expand inline into the enclosing async context).
Control flow uses a labelled-block + `consumer_alive` flag rather
than `return` so the macro stays generic over the surrounding
return type.
The CPU build (default-feature workspace, what `clippy` and `test`
jobs exercise) doesn't compile this `#[cfg(feature = "cuda")]`
branch, which is why local CI green-lit it. The cuda-check job
should catch this category of breakage now that #cb30383+CI-fix
landed; this commit just resolves the actual breakage on the
prerelease workflow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#9.
Replaces the hardcoded `format_qwen3_prompt` ChatML glue with
`minijinja`-driven rendering of the model's own `chat_template`
from `tokenizer_config.json`. The request's `chat_template_kwargs`
flow into the Jinja context so model-specific levers
(Qwen3's `enable_thinking: false`, etc.) actually take effect.
## Implementation
- New `harness::chat_template` module with three entry points:
- `load_chat_template_alongside(tokenizer_json_path)` — probes
`tokenizer_config.json` in the same hf-hub snapshot directory.
Supports both the canonical string-form `chat_template` and
the array-form some tokenizers ship (multi-template models).
- `render_chat_template(template, messages, tools, kwargs)` —
renders via `minijinja`. Messages flatten into the
`[{role, content}]` shape HF templates iterate, with
per-message extras (`tool_calls`, `tool_call_id`) preserved.
`tools` and `kwargs` add into the Jinja context so templates
that reference them work without us interpreting their shape.
- `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE`
(default true). Falsy values force the fallback path
everywhere — a kill switch for emergency rollback without a
rebuild.
- `LoadedModel.chat_template: Option<String>` and the TP
equivalent are populated once at load time. `None` (no
tokenizer_config.json, parse error, missing field) routes the
fallback path silently; logs go through `tracing::debug`/`warn`
per condition.
- New `build_prompt_for_request(chat_template, request)` wraps
the decision: when both the template is present AND the kill
switch is off, render with kwargs from `request.extra` (looks
up `chat_template_kwargs` and `tools` lazily). On render error
→ warn + fallback to `format_qwen3_prompt`. Wired into all four
current prompt-build sites (single-GPU stream + non-stream, TP
stream + non-stream).
## Dependency
`minijinja = "2"` with the `builtins`, `json`, and `serde`
features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used
internally by HF's `tokenizers-rs` for its own chat templating;
the API surface we touch (`Environment::add_template` +
`Template::render(serde_value)`) is stable.
## Validation strategy
I can't byte-compare the new path's output against
`format_qwen3_prompt` for live models without GPU (CI doesn't
have one). The fallback path and kill switch are the mitigations
— a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the
neuron service env if the chat template renders surprisingly on
Qwen3-8B in production. The legacy formatter stays the
fail-closed default.
## Scope cuts (documented in module header)
- Tool-definition lifting from helexa-acp's system-prompt
injection into the chat_template's native tools block is
deferred. Today the request's `tools` array threads into the
Jinja context, but helexa-acp continues to inject Hermes-format
tool descriptions into the system prompt for backwards-compat
with non-cortex endpoints.
## Tests
9 unit tests in `chat_template`: kill-switch matrix (truthy /
falsy / unset), template loading (string form, array form,
missing file, unparseable JSON, missing field), rendering
(basic conversation threading, kwargs forwarding, message-extras
threading for tool_calls).
215 workspace tests pass; clippy + fmt clean across all workspace
features (default).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refs #7.
OpenAI's Responses API spec emits `response.in_progress` between
`response.created` and the first output-item event to mark
"request validated, model is generating". Some Responses-API
clients distinguish loading-spinner vs streaming-spinner UI based
on which event arrived last; emitting both keeps the wire shape
matched.
Carries the same shell as `response.created` (status=in_progress,
empty output, no usage yet) — both events are payload-light
bookkeeping, distinguished only by the event name.
The hosted-tool event families remaining in #7 (web_search_call,
code_interpreter_call, file_search_call, image_generation_call)
stay deferred until the underlying tools exist in neuron.
Updated `full_stream_emits_expected_event_sequence` to assert the
new event lands in position 1; downstream indexing shifted by one
across the existing test assertions. CI green, fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
act launches step shells without sourcing /etc/profile, so the
gitea_runner user's PATH lacks /usr/local/cuda-13.0/bin. cudarc's
build.rs panics with ENOENT on `nvcc --version` under the neuron
crate's cuda-version-from-build-system feature. build-prerelease.yml
already does this export — mirror it here.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#6.
Same model-agnostic seam as #8 but for tool-call markers
(`<tool_call>` / `</tool_call>` on Qwen3-Coder, Hermes-format,
DeepSeek-Coder, gpt-oss, …). Lets Zed's tool-use feature and any
other vanilla OpenAI chat client get structured `tool_calls` deltas
out of cortex without having to parse markers themselves.
## Implementation
1. **Tokenizer probe at load time** (`detect_tool_call_token_pair`
in `wire::event`) — same shape as the reasoning-marker probe
from #8. Both open AND close must resolve to single token ids;
non-tool-use models get `None` and pass through unchanged.
Stored on `LoadedModel.tool_call_tokens` and the TP analogue.
2. **New `InferenceEvent::ToolCall` variant** — carries `index`
(call slot, per-turn counter), generated `id` (`call_<hex>_<idx>`),
`name`, and the complete `arguments` JSON string. One event per
parsed call.
3. **Token-level state machine** in all three streaming paths
(CPU `run_inference_streaming`, CUDA single-GPU
`stream_inference_via_worker`, CUDA TP `chat_completion_tp_stream`)
layered on top of #8's reasoning routing:
- `<tool_call>` token → enter buffering state, clear buffer.
- Tokens while buffering → accumulate into `tool_call_buf`
via the decoder (so multi-byte UTF-8 still buffers correctly)
without emitting anything visible.
- `</tool_call>` token → take the buffer, parse with
`parse_tool_call_body` (extract `name` + `arguments`),
emit a structured `ToolCall` event with a fresh `call_<hex>`
id and the parsed fields.
- On parse failure → fall back to re-emitting the original
`<tool_call>{buf}</tool_call>` block as plain text content
so helexa-acp's existing `ToolCallParser` repair passes still
have a chance to recover the call.
4. **OpenAI chat projector** emits the OpenAI streaming
`tool_calls` delta shape on `InferenceEvent::ToolCall` —
`{tool_calls: [{index, id, type:"function",
function:{name, arguments}}]}`. One chunk per call slot.
5. **OpenAI Responses projector** drops `ToolCall` events for
now (Responses-side function_call event family routing tracked
under #7); the chat path is what unblocks Zed's tool use today.
## Acceptance
- Vanilla OpenAI chat clients (Zed's tool-use feature, any other
OpenAI-compatible tool-call consumer) get structured tool_calls
deltas against cortex+neuron without having to parse `<tool_call>`
markers in content.
- helexa-acp continues to work — when neuron parses cleanly, it
consumes the structured deltas through its existing decoder.
When the model emits malformed JSON, neuron falls back to text
pass-through and helexa-acp's `ToolCallParser` recovers via the
same path it always did.
- Models without tool-call markers in their tokenizer pass through
unchanged.
- No hardcoded model knowledge — entirely driven by tokenizer
metadata.
## Tests
2 new detection tests in `wire::event` (Qwen3-style marker
detection, no-marker case). The streaming paths themselves stay
covered by the existing chat-completions integration tests; full
end-to-end exercise of the new path requires GPU-loaded models
and lives outside the CI test surface.
215 workspace tests pass; clippy + fmt clean across the
workspace.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#8.
Reasoning-capable models (Qwen3, DeepSeek-R1, gpt-oss, Mistral
Magistral, …) emit `<think>...</think>` blocks inline in their
content stream. The chat-completions wire format has no slot for
reasoning, so until this change every consumer either parsed the
markers themselves (helexa-acp) or wrote the raw scratchpad
content into their UI (Zed's commit-message generator — visible
as the leaked reasoning block on every generated commit message
against benjy's Qwen3-8B).
## Implementation, model-agnostic by design
The neuron side now does token-level routing without any
hardcoded model knowledge:
1. **At load time** (`detect_reasoning_token_pair` in
`wire::event`), probe the tokenizer's vocabulary for a known
reasoning-marker pair: `<think>` / `</think>` (Qwen3,
DeepSeek-R1, gpt-oss), `[THINK]` / `[/THINK]` (Mistral
Magistral), and a couple of derivatives. Each marker must
resolve to a single token id; if both open and close resolve,
stash on `LoadedModel.reasoning_tokens` (similarly
`TpLoadedModel`). Non-reasoning models get `None` and pass
through unchanged.
2. **At inference time**, the three streaming paths
(`run_inference_streaming` CPU, `stream_inference_via_worker`
CUDA single-GPU, `chat_completion_tp_stream` CUDA TP) now
check each sampled token against the pair via the new
`handle_reasoning_marker` helper before feeding it to the
detokeniser. Open marker → set `in_reasoning = true`, drop
the marker. Close marker → unset, drop. Other tokens go
through `emit_delta(_blocking)` which now picks
`ReasoningDelta` or `TextDelta` based on state. Markers
never appear in the streamed output.
3. **In `wire::openai_chat`**, the projector splits into:
- `project_chat_stream` (unchanged signature; default
behaviour — drops `ReasoningDelta`)
- `project_chat_stream_with(rx, …, ChatProjectionConfig)` —
when `include_thinking: true` and `reasoning_markers:
Some(_)`, re-wraps reasoning content with the literal
open/close marker text and emits as content deltas.
Preserves the on-the-wire shape that helexa-acp's
`ThinkParser` expects.
4. **HTTP handler** reads `x-include-thinking: true` (case-
insensitive `1`/`true`/`yes`) from the request headers and
threads it into the projection config. cortex-gateway already
forwards arbitrary headers verbatim, so the opt-in works
end-to-end without gateway changes.
5. **helexa-acp's `openai_chat` provider** sets
`x-include-thinking: true` on every request so its existing
`ThinkParser` keeps receiving the marked content stream.
`ThinkParser` itself is unchanged — needed for endpoints that
aren't reasoning-aware (OpenRouter, OpenAI directly, etc.).
## Acceptance
- Zed's commit-message generator (vanilla chat-completions
client, no `x-include-thinking`) gets clean commit messages
with no `<think>` block.
- helexa-acp sessions continue to render thinking in Zed's
thought UI via the opt-in path.
- Models without reasoning tokens declared in their tokenizer
pass through unchanged.
- Implementation contains zero references to "qwen3" or any
specific model — entirely driven by tokenizer metadata.
## Tests
9 new tests in `wire::event` (token-pair detection across 4
marker conventions, edge cases) and `wire::openai_chat` (default
drop, opt-in re-wrap with multi-chunk reasoning, close-marker on
Finish, fallback when markers absent, off-switch with markers
present). All 213 workspace tests pass; fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 7. Walks a new user from "never heard of helexa-acp" to
"chatting via Zed against helexa or a public API in 10 minutes":
- crates/helexa-acp/README.md — install (from source / COPR),
quick-start env-var path, multi-endpoint TOML, full Zed setup,
endpoint cookbook (cortex/neuron, OpenAI, Anthropic, OpenRouter,
LM Studio, multi-cortex), three session modes (Default / Bypass /
Plan) with their tool tables, tool surface + path-handling rules,
session resume, context compaction, troubleshooting for the
five failure modes a new user is likely to hit, and architecture
reference for contributors.
- helexa-acp.example.toml — copy-paste-and-edit starter config at
the repo root, mirroring the existing cortex.example.toml /
neuron.example.toml pattern.
No code changes. fmt + clippy clean as a sanity check.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 6b. Third provider impl, completing the wire-format trio
(openai-chat, openai-responses, anthropic-messages). Lets a
helexa-acp endpoint configured with `wire_api = "anthropic-messages"`
drive Claude models — either against Anthropic directly or via
cortex's /v1/messages translation surface.
## Encoder (CompletionRequest → Anthropic body)
- System messages flatten to the top-level `system` field
(concatenated with blank lines when there are multiple).
- User text → `{role:"user", content:"..."}`.
- User MultiPart (text + images) → `content` array with Anthropic's
distinct image shape: `{type:"image", source:{type:"base64",
media_type, data}}` — structurally different from OpenAI's
`image_url` data URI.
- Assistant text → `{role:"assistant", content:"..."}`.
- Assistant tool_calls → `content` array with optional `{type:"text"}`
block plus one `{type:"tool_use", id, name, input:<parsed json>}`
per call. The internal arguments JSON string is parsed back to a
Value before encoding (Anthropic requires the parsed form);
malformed JSON falls back to a String input so the request body
still serialises.
- Tool result → `{role:"user", content:[{type:"tool_result",
tool_use_id, content}]}` per Anthropic's convention (no separate
`tool` role).
- `max_tokens` is required by Anthropic; defaults to 8192 when the
request doesn't specify.
## Decoder (Anthropic SSE → CompletionEvent)
Named SSE events:
- `message_start` → captures input_tokens from `usage` for the
eventual UsageStats.
- `content_block_start` (type=text) → TextDelta (initial text, if any).
- `content_block_start` (type=tool_use) → ToolCallStart; if a
pre-buffered `input` is present, also emits a single
ToolCallArgsDelta.
- `content_block_start` (type=thinking, for extended-thinking
models) → ReasoningDelta.
- `content_block_delta` (text_delta) → TextDelta.
- `content_block_delta` (input_json_delta) → ToolCallArgsDelta,
correlated by block index.
- `content_block_delta` (thinking_delta) → ReasoningDelta.
- `message_delta` → Usage (final output_tokens) + Finish with
stop_reason mapped: end_turn/stop_sequence → "stop", max_tokens
→ "length", tool_use → "tool_calls".
- `message_stop` → stream terminates.
- `ping` ignored (Anthropic's keep-alive).
- `error` → yields Err and ends the stream.
## Wiring
- Authentication: `x-api-key` + `anthropic-version: 2023-06-01`
headers (not Bearer). Both ship when api_key is configured;
servers that don't care (cortex) ignore them.
- `WireApi::AnthropicMessages` in build_provider now constructs
the provider instead of erroring "reserved for future".
- `provider::mod.rs` registers the new module.
18 new unit tests: encoder (system collapse, multi-system concat,
default max_tokens, multipart with image, tool_use blocks, tool
results, malformed JSON arg fallback), decoder (text streaming,
tool_use lifecycle, max_tokens→length mapping, empty deltas, ping
events, error events, cancellation, malformed payload skip,
thinking blocks).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI run 255 job 3 (CUDA type-check) fails with:
error: could not execute process `*** rustc -vV` (never executed)
Caused by: No such file or directory (os error 2)
The redacted `***` is `sccache`. The ci.yml workflow-level env block
sets `RUSTC_WRAPPER: sccache` because the generic `rust` runner has
sccache installed and routes the cache to caveman.kosherinata.internal.
The new `cuda-check` job runs on `cuda-13.0` (where nvcc lives), and
that runner doesn't carry sccache on PATH — so cargo's first action
(`sccache rustc -vV` to probe the compiler version) fails before
borrow-check even starts.
`build-prerelease.yml`, which uses the same `cuda-13.0` runner for
the actual release neuron builds, deliberately does NOT set
RUSTC_WRAPPER. That's the pattern this commit applies.
Fix: override `RUSTC_WRAPPER` (plus the SCCACHE_* and AWS_* env
locally on the job. We lose caching on the cuda-check job (it's
borrow-check-only and finishes in a couple minutes anyway), but
the gate runs.
The job's purpose — fail fast on `#[cfg(feature = "cuda")]`
borrowck errors that the default-feature gate misses — is what
matters, and that purpose was undermined by the env inheritance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 6a. Implements the `Provider` trait for OpenAI's Responses
API surface, parallel to the existing `OpenAIChatProvider`. Lets a
helexa-acp endpoint configured with `wire_api = "openai-responses"`
drive a `/v1/responses` server (today: neuron through cortex; later:
OpenAI directly) using the same agent-loop machinery the chat
provider already supports.
## Encoder (CompletionRequest → Responses body)
- System messages collapse into a single top-level `instructions`
string. Multiple system messages concatenate with blank lines so
ordering is preserved.
- User messages become `{type:"message", role:"user", content:…}`
input items. Text content stays a bare string; MultiPart content
(text + images, post-Stage 5) becomes a
`[{type:"input_text"}, {type:"input_image"}]` array with images
encoded as `data:{mime};base64,{data}` URIs — exactly the shape
neuron's `wire::openai_responses::request_to_chat` accepts.
- Assistant text turns become an `output_text` content part inside
a `message` item.
- Assistant tool-call turns become `function_call` input items.
- Tool result turns become `function_call_output` input items.
- `max_tokens` translates to `max_output_tokens`.
## Decoder (Responses SSE → CompletionEvent)
Reads named events on the SSE `event:` line:
- `response.output_text.delta` → `CompletionEvent::TextDelta`
- `response.output_item.added` with `type:"function_call"` →
`CompletionEvent::ToolCallStart` (and, when the upstream
pre-buffers fully, a single `ToolCallArgsDelta`)
- `response.function_call_arguments.delta` →
`CompletionEvent::ToolCallArgsDelta`, correlated back to the
tool-call slot by output_index.
- `response.completed` → `CompletionEvent::Usage` (if present) +
`CompletionEvent::Finish` with reason mapped from `status`:
`"completed"` → `"stop"`, `"incomplete"` → `"length"`.
- Bookkeeping events (`response.created`, `response.in_progress`,
`*.content_part.*`, `*.output_text.done`, `*.output_item.done`,
`*.function_call_arguments.done`, reasoning_*) are skipped.
## Wiring
- `EndpointConfig::responses_url()` joins `{base_url}/responses`.
- `WireApi::OpenAiResponses` in `build_provider` constructs the new
provider (was previously a "reserved for future" error).
- `provider::mod.rs` registers the new module.
## Cuts (carried over from neuron-side issues)
- The decoder's `ToolCall*` handling fires correctly when the
upstream emits `function_call` items, but the neuron candle
harness doesn't yet (Refs #6). Real tool-call testing against
cortex+neuron stays on the chat path until #6 lands.
- Reasoning events (`response.reasoning_*`) are deliberately
dropped today; once neuron emits `InferenceEvent::ReasoningDelta`
(Refs #5) the projector on the neuron side will start firing the
reasoning event family and this decoder will need a matching
case to route them to `CompletionEvent::ReasoningDelta`.
13 new unit tests cover encoder (system collapse, multipart user
input, assistant output_text encoding, tool-call round-trip via
function_call items) and decoder (text streaming, empty deltas
dropped, length finish, function_call lifecycle, inline-arguments
shape, cancellation, malformed payload skip).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Step 3 of the Responses rollout: plain proxy route on the gateway,
no translation. Neuron speaks the Responses API natively after Step
2 (commit 957f704), so the gateway just needs the same routing
shape it uses for /v1/chat/completions — extract `model`, resolve
via router::resolve, forward verbatim.
- New `POST /v1/responses` handler in handlers.rs::responses.
- Mock neuron under tests/common/mod.rs gains a `/v1/responses`
endpoint that mirrors the ResponsesResponse shape neuron emits.
- New integration test file `tests/responses.rs` exercises:
- Happy path (200, body round-trips, ResponsesUsage shape).
- Unknown model → 404 (matches chat-completions error shape).
- Missing `model` field → 400 (same extract_model helper).
Streaming proxy works through the same path as chat completions —
the upstream Content-Type (`text/event-stream` for stream:true,
`application/json` otherwise) propagates through proxy_with_metrics
unchanged. Live-stream integration tests against a streaming mock
deferred until we exercise the path against a real neuron, since
the chat-completions streaming test already covers the proxy's
SSE forwarding mechanics.
Three new tests; clippy + fmt clean across the workspace.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.
## Surface
- `cortex-core::responses` envelope types: `ResponsesRequest`,
`ResponsesInput` (text | items), `ResponsesInputItem` (message |
function_call | function_call_output | reasoning),
`ResponsesContentPart` (input_text | input_image | output_text),
`ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
a `events::*` constant module so the projector and the wire shape
stay in sync without string-typos.
- `neuron::wire::openai_responses`:
- `request_to_chat(req)` flattens Responses input + instructions
into a `ChatCompletionRequest` the candle harness already
understands. Text-only Parts collapse to a string; mixed
text+image Parts go to chat's content-array shape; reasoning
items drop; function_call / function_call_output round-trip
via tool_calls / tool_call_id metadata so the surface is
consistent for the day the harness emits tool calls.
- `project_responses_stream(rx, meta)` reads InferenceEvents
and emits the eight named events that compose a Responses
stream: response.created → output_item.added → content_part.added
→ output_text.delta×N → output_text.done → content_part.done
→ output_item.done → response.completed. Synthesises start
frames if the producer skips Start (poisoned model, early
disconnect) so the stream stays coherent.
- `build_response(meta, text, reason, usage)` for the
non-streaming path.
- `CandleHarness::inference_stream(req)` extracted from
`chat_completion_stream`, returning a typed `InferenceStream`
(event receiver + id/created/model_id metadata). Both
`chat_completion_stream` and the new `responses_stream` are now
thin wrappers that pick their wire projection. TP path got the
same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).
- `POST /v1/responses` route on neuron. Non-streaming returns one
buffered `ResponsesResponse`; streaming returns axum SSE with
both event names and JSON data per frame (Responses, unlike
chat completions, uses named `event:` lines). Reused
`inference_error_response` helper hoisted out so the chat and
responses handlers share the InferenceError → HTTP mapping.
## CI
Also bundles the `cuda-check` runner-label fix from feedback on
commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so
cudarc's nvcc-version build script blew up. Switched to
`runs-on: cuda-13.0` per the existing labels.
## Scope cuts (documented in the modules)
- `previous_response_id` rejected at translate time with 400
(`code: chained_conversation_not_supported`) — stateful chained
conversations need a persistence layer we haven't built.
- Reasoning items dropped (no Qwen3 `<think>` routing yet).
- Single output item per response (one `"message"` carrying text);
`function_call` items reserved but not synthesised.
- Streaming events cover the core set; `response.in_progress`
and the web_search / image_generation event families are
out-of-scope.
22 new tests: 5 in cortex-core (envelope round-trips), 13 in
neuron::wire (request translator + projector + non-streaming
builder), 4 in neuron's tests/api.rs (route surface — 503 when no
candle, 400 on previous_response_id, 404 on missing model for
both stream and non-stream).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run 244 caught a use-of-moved-value in a `#[cfg(feature = "cuda")]`
block that the default-feature workspace clippy/test gate had no
chance of seeing. The error appeared only when the RPM build
workflow compiled with `--features cuda` — 30+ minutes after push.
Add a `cuda-check` job to ci.yml that runs `cargo check -p neuron
--features cuda --all-targets` on the rpm runner (where nvcc /
cudarc build deps live; the generic `rust` runner doesn't have
them). Borrow-check only — we never run tests here, the runner
has no GPU. Same retry pattern as clippy/test.
Both SRPM jobs (`srpm-cortex`, `srpm-neuron`) now gate on
`cuda-check` so a CUDA build break can't reach the release pipeline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Step 1 refactor moved the InferenceEvent receiver wrap to *after*
the orchestration spawn in chat_completion_tp_stream, but the spawn
moves both `id` and `model_id` into its async closure (used heavily
by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck
error E0382 use-of-moved-value on the wire_chat::project_chat_stream
call.
The non-CUDA build doesn't exercise this branch (it lives behind
`#[cfg(feature = "cuda")]`) which is why the workspace clippy/test
gate passed locally and on the regular CI workflow. The RPM build
workflow, which compiles with --features cuda, caught it (run 244
jobs 2/3/4 against beast / ampere / ada respectively, all the same
error).
Fix: snapshot `id` and `model_id` into `projector_id` /
`projector_model_id` before the spawn, use those at the projector
call site. The originals stay free to be moved into the closure.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Step 1 of the OpenAI Responses API rollout. Pure refactor — no new
endpoints, no behaviour change on the wire. Lays the seam for
emitting Responses-shaped streaming events from the same harness
output as chat completions in Step 2.
- New `neuron::wire` module tree:
- `wire::event::InferenceEvent` — format-agnostic enum
(Start, TextDelta, ReasoningDelta, Finish) the candle harness
now emits as its native streaming currency.
- `wire::event::FinishReason` — typed reason that maps cleanly
onto OpenAI `finish_reason`, OpenAI Responses `status`, and
Anthropic `stop_reason` strings.
- `wire::openai_chat::project_chat_stream` — async task that
consumes an InferenceEvent receiver and produces a
ChatCompletionChunk receiver, stamping per-request metadata
(id, created, model_id) onto every chunk. Output matches the
pre-refactor wire shape bit-for-bit.
- candle.rs refactored to emit InferenceEvent on its internal
channel through all three streaming paths (CPU
run_inference_streaming, CUDA single-GPU stream_inference_via_worker,
CUDA TP chat_completion_tp_stream). The streaming functions lost
their id/created/model_id parameters since wire-format metadata
now lives in the projector.
- emit_delta + emit_delta_blocking simplified to single-purpose
TextDelta emitters with no wire-format coupling.
- chat_completion_stream wraps the InferenceEvent receiver in
wire_chat::project_chat_stream before returning so the
/v1/chat/completions HTTP handler keeps consuming
ChatCompletionChunks unchanged. External signature preserved.
Also fixes a pre-existing helexa-acp test race (three modules each
declared their own static LOCK for HOME mutation, so cross-module
parallelism flaked tests that read HOME at runtime). Consolidated
onto a single crate-wide path_util::ENV_LOCK.
122 helexa-acp tests + 44 neuron tests pass (5 new wire projection
tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp
suite 3x to confirm the env race is closed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 5. Zed clipboard/DnD images get forwarded as OpenAI
content-array messages on user turns.
- New MessageContent::MultiPart variant + MessagePart (Text|Image)
+ ImageData struct (mime_type, base64 data, optional uri).
- flatten_prompt now produces structured content: collapses to
Text when every block is text (some upstreams treat array-form
as vision-only and refuse on text-only models), otherwise
produces MultiPart preserving block order.
- OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url",
image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user
messages. Data URIs are used over remote `uri` because they
round-trip through every upstream we care about.
- prompt_capabilities.image = true at initialize so Zed actually
sends image blocks.
- compaction estimates ~512 tokens per image (the middle of the
Qwen3-VL / OpenAI detail range) so the budget tracker doesn't
pretend images are free.
- session/load replays image-bearing user turns by surfacing the
text parts verbatim and rendering each image as a "[image: {mime}
({n} bytes)]" placeholder chunk — Zed can show the prior text
context even though re-uploading the bytes through ACP isn't
meaningful for resume.
- 4 new tests: flatten produces MultiPart in block order, image-only
prompts still flatten to MultiPart, encoder emits the correct
array shape, text-only encoding stays as the string form.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related polish fixes for daily use:
- New `path_util` module expands `~`, `~/…`, `$HOME`, and `$HOME/…`
prefixes in every tool that takes a path (read_file, write_file,
edit_file, list_dir, bash cwd). The expansion is also applied to
the plan-mode write gate so `~/.local/share/helexa-acp/plans/…`
comparisons behave correctly regardless of which form the model
emits.
- `read_file` now falls back to `std::fs::read_to_string` when ACP's
`fs/read_text_file` errors out. Zed's workspace-scoped read was
the source of "model can't see ~/git/architecture/generic.md"
when the session cwd is a different project; the fallback lets
the agent pull in shared material that lives outside the active
workspace, the same way `list_dir` already does via local
`std::fs::read_dir`. Local fallback honours line/limit args.
The fallback also produces a combined error message when both ACP
and local-fs reads fail, so the model sees what actually broke
rather than just the ACP-side error.
14 new unit tests cover path_util's prefix matrix, fallback
success/failure paths, and the line/limit slicing in fallback.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 4. Zed's model dropdown now lists every model from every
configured endpoint, and switching it routes the next prompt to a
new endpoint+model.
- Enable `unstable_session_model` on the agent-client-protocol dep
so SessionModelState / SetSessionModelRequest / ModelInfo are
available.
- Agent::new becomes async and calls Provider::list_models on every
provider at startup; per-endpoint failures warn-and-skip instead
of aborting the agent.
- With a single endpoint configured, model ids appear bare; with
multiple endpoints every id carries the `endpoint:` prefix so the
picker is unambiguous and parse_model_selector routes correctly.
- NewSessionResponse and LoadSessionResponse attach SessionModelState
with the session's current model id + the aggregated catalogue.
- session/set_model: validates the requested model id against
resolve_provider, mutates session.model_id, and persists so the
on-disk transcript reflects the new model.
Three new aggregate_models tests cover the prefixing rule (bare vs
multi-endpoint) and warn-and-skip on a failing endpoint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A new src/compaction.rs module projects rolling conversation history
into a token budget before each completion. Older tool results and
assistant prose get elided to one-line markers; system prompts, user
turns, and the last KEEP_TAIL=4 messages stay verbatim. tool_call_id
pairing is preserved so OpenAI strict-schema providers keep working.
Driven by a new per-endpoint `context_window` config field (also
HELEXA_ACP_CONTEXT_WINDOW for the env-only single-endpoint case).
When set, prompt budget = context_window - max_tokens - 512_safety;
when unset, behaviour is unchanged.
Without this, a 32 K Qwen3 dies with `prompt_too_long` after the
first few read_file results pile up in history — the symptom seen
in plan-mode dogfooding on beat.
10 new unit tests cover the compaction strategy and the prompt
budget arithmetic.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plan mode is the most restrictive of the three session modes: bash is
disabled outright, writes are confined to a per-project plan directory
under $XDG_DATA_HOME/helexa-acp/plans/<basename>-<8hex>/, and reads /
list_dir are unrestricted. The system prompt is rebuilt at the top of
every round so a mid-turn switch into (or out of) plan mode takes
effect on the next streaming round, and plan mode appends a 3-option
menu instructing the model to stop and let the user pick how to
proceed once the plan is complete.
The project id is basename + FNV-1a-32 of the cwd so it stays stable
across runs (SipHash's DefaultHasher reseeds per process), while still
disambiguating multiple checkouts that share a final path component.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three changes addressing "session stops mid-turn and disk store
doesn't update":
1. Per-round persistence. drive_prompt previously called
store::save() once at the very end of the turn. If the loop
stalled in a later round (long-running bash, upstream SSE that
never finished, wedged ACP roundtrip), earlier successful
rounds lived only in the spawned task's `new_turns` and never
reached disk. Move the extend-history + save into a helper
(extend_and_persist) and call it at the end of every loop
iteration. The post-loop save catches whatever the break paths
leave behind. Failure is logged not propagated.
2. Cancel previous in-flight prompt on new session/prompt. The
handler used to overwrite SessionState.cancel with a fresh
token *without firing the old one*. A wedged prior prompt would
then live forever, holding session-state references and never
persisting. Now we fire the existing cancel under the lock
before installing the new token — the old task observes
is_cancelled() on its next .await and unwinds.
3. Per-round and per-tool log lines. drive_prompt now emits:
- INFO prompt round: streaming { round, of, history_turns }
- INFO dispatch tool { tool, tool_call_id }
- INFO dispatch tool complete { tool_call_id, is_error }
- INFO prompt round complete; persisting { round, turns }
- INFO prompt complete { stop_reason }
so the next hang shows up by line number in /tmp/helexa-acp.log
instead of as silence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
session/list and session/load were both implemented but clicking
a session in Zed's thread picker still left the agent panel
empty. Zed (and ACP clients in general) doesn't cache the
transcript for custom agent_servers entries — it only owns
conversation state for first-party agents. For custom agents the
expectation is that session/load returns successfully and the
agent then re-emits the conversation as a stream of session/update
notifications so the client can rebuild its view.
Implement that replay path:
- handle_load_session now returns (LoadSessionResponse, Vec<Message>)
so the caller has the history available after the in-memory
hydration finishes.
- The session/load closure responds to the request *first*, then
spawns a task that calls replay_history off the dispatch loop.
- replay_history walks the persisted history and emits one
session/update per turn:
Role::User → UserMessageChunk(text)
Role::Assistant text → AgentMessageChunk(text)
Role::Assistant tool → AgentMessageChunk for any accompanying
text + one ToolCall card per call (with
kind/title/raw_input rendered the same
way as the live dispatch path)
Role::Tool result → ToolCallUpdate matching the assistant's
call id, status: Completed, content set
to the result text
Role::System → skipped (system prompts aren't shown)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 3b only implemented the trailing half of resume: write
sessions to disk + handle session/load. But Zed (and any ACP
client) needs `session/list` to discover *which* session belongs
to the workspace it's reopening — without it, the client only
knows how to mint new sessions and resume never fires even
though the JSON sits ready on disk.
Add the missing pieces:
- store::list / list_in_dir — enumerate {id}.json under
sessions_dir(), optionally filter by cwd, sort recent-first.
Skips unparseable files with a warn rather than aborting.
- store::unix_to_iso8601 — RFC 3339 formatter for
SessionInfo.updated_at; pulls chrono in directly (already in
the dep tree transitively).
- agent::handle_list_sessions — wires the request to the store,
builds SessionInfo entries with derived titles (first user
turn, truncated to 60 chars).
- agent::initialize_response — advertise
session_capabilities.list = {} alongside the existing
load_session: true.
Verified end-to-end against the user's real hxa-1.json
(60-turn beat conversation): `session/list` returns the entry
with cwd, derived title, and ISO 8601 timestamp.
4 new store unit tests for list filtering, missing-dir
handling, unparseable-file skipping, and ISO 8601 formatting.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zed restarts (frequent during helexa-acp dogfooding) used to lose
every conversation because we'd ignore the load_session capability
and treat every project-reopen as a fresh session/new. Persist
sessions to disk and honour session/load so the agent panel comes
back where it left off.
Storage layout:
$XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json
Each file holds session_id, cwd, model_id, mode_id, full Message
history, plus created/updated timestamps. Atomic save via
tempfile+rename so a crash mid-write can't corrupt the store.
Touch points:
- src/store.rs (new) — sessions_dir() resolution, save/load via
default and explicit-dir entry points (so unit tests don't have
to race on XDG_DATA_HOME). 5 unit tests cover round-trip,
not-found errors, atomic overwrite, tool-call/result preservation,
and the filename sanitiser's path-traversal handling.
- src/provider/mod.rs — Serialize/Deserialize on Role, Message,
MessageContent, ToolCall. MessageContent::Text turned into a
struct variant ({text: ...}) so internally-tagged JSON works.
- src/agent.rs — initialize_response advertises load_session: true;
handle_load_session reads the file, snapshots in-memory state,
returns LoadSessionResponse with the persisted mode preselected;
drive_prompt persists at the end of every prompt round under the
session lock with the I/O outside the lock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6-27B occasionally emits a <tool_call> body with the right
arguments but no top-level `name` field — observed in the field as
mkdir-style bash calls like
{"arguments":{"command":"mkdir -p .../doc/plan/{01-discovery,...}"}}
with no `name`. The agent had no tool to dispatch and surfaced a
Failed card; the model would then hang or retry the same shape.
Add a shape-based inference layer:
- tools::infer_tool_name(arguments) — given an `arguments` object
alone, return Some(name) when the key set uniquely identifies one
tool: `{command}` or `{command,cwd}` → bash, `{path,content}` →
write_file, `{path,old_text,new_text}` → edit_file. Ambiguous
shapes (`{path}` alone — could be read_file or list_dir) return
None so the agent still emits a Failed card rather than guessing.
- agent::try_repair_missing_name(raw) — parses a malformed body,
applies infer_tool_name, returns (name, args_json) on success.
- drive_prompt sweeps malformed_calls through this repair before
the Failed-card path. Recovered calls go into tool_buckets at
the next free index and dispatch through the normal tool loop.
10 new unit tests in tools::tests cover the inference table plus
the verbatim mkdir failure from the field log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for cases where Qwen3 sometimes emits slightly-off
JSON inside <tool_call> blocks:
1. JSON repair pass in qwen3::parse_tool_call_body — strip up to
three trailing extra `}` characters (model overshoots its closing
braces), and hoist `name` out of `arguments` when it lands
nested instead of as a sibling. Both observed in the field; both
trivially repairable; both now dispatch as normal tool calls
instead of falling back to the malformed path.
2. New CompletionEvent::MalformedToolCall variant for the cases
repair can't fix. decode_stream now emits it instead of wrapping
the raw body in a TextDelta, and agent.rs surfaces each one as
a Failed SessionUpdate::ToolCall card (so Zed renders it as a
structured failure UI element rather than dumping the body
inline) plus a synthetic tool-call/tool-result history pair so
the model gets clear feedback for self-correction on the next
round.
Empty <tool_call></tool_call> blocks are now a no-op too (no
Malformed event), matching the existing empty-<think> behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
512 is too low for any modern coding model — clients that don't
explicitly set max_tokens get clipped responses with no diagnostic.
Bump the fallback at all four inference call sites (single-GPU
streaming + non-streaming, TP leader + non-leader) to 8192, which
fits comfortably within Qwen3-class context windows after a
typical agent prompt and lines up with what helexa-acp / a0 / curl
clients reasonably expect.
Clients that explicitly set max_tokens (now including helexa-acp
via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent was sending max_tokens: None, letting cortex/neuron pick
its own default — which trips Zed's "Output Limit Reached" on long
turns. Add a per-endpoint max_tokens option in EndpointConfig
(TOML key and HELEXA_ACP_MAX_TOKENS env var for the single-endpoint
fallback) that the agent threads into every CompletionRequest by
endpoint name.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3 emits chain-of-thought as literal <think>...</think> tags
inside delta.content rather than via the separate reasoning_content
field — so without parsing the markers, the thinking shows up in
the message pane as ordinary text. Add a small ThinkParser in
qwen3.rs (same chunk-boundary discipline as ToolCallParser) and
stage it after the tool-call parser in decode_stream: text events
from the tool-call parser are fed in and split into TextDelta /
ReasoningDelta. Zed now renders thinking in its dedicated thought
UI; visible answer text stays in the message pane.
The parking-lot entry from the plan is now closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The catch-all on_receive_dispatch handler was applying
respond_with_error to *every* Dispatch variant, including Response.
For Response variants, that call routes the error to the
ResponseRouter for the *outgoing* request — silently overwriting
the real reply from Zed with "Internal error: not implemented yet".
Every ACP roundtrip we issue (fs/read_text_file, fs/write_text_file,
session/request_permission, terminal/*) was therefore returning an
error to the tool runner regardless of what Zed actually responded.
The model saw uniformly-failing tools, gave up, and confabulated
plausible explanations.
Fix: pattern-match the Dispatch. Response → forward to its router
via respond_with_result. Request / Notification → keep the
"not implemented yet" error response as before.
Found via debug logs showing
WARN helexa_acp::agent: unhandled ACP message method="fs/read_text_file"
right before every tool failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Editors that launch ACP agents (Zed today) don't reliably surface
the child's stderr — and `args` in an `agent_servers` config is
exec-args, not shell, so the usual `&>>` redirect trick doesn't
work. Add a HELEXA_ACP_LOG_FILE env var that, when set to an
absolute path, routes the tracing subscriber to append-write that
file (ANSI off) instead of stderr. RUST_LOG still controls levels.
Unopenable paths fall back to stderr with a warning so a typo
doesn't silence the agent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnostic for "the tool ran but the model thinks it failed" cases.
Logs at debug level:
- exec_bash: terminal/create command + cwd, terminal/exit code/signal,
terminal/output bytes + truncated flag + 200-char snippet.
- dispatch_tool_call: 200-char snippet of every successful result
before it's folded back into history.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OpenAI `tools` API field isn't load-bearing in this stack —
neuron's chat template renders only message.content, so tool
definitions sent that way never reach the model. Move both sides
of the tool conversation into the Qwen3 Hermes wire format the
model is actually trained on:
- Append a `# Tools` block to the system prompt describing every
available function (qwen3::render_tool_block).
- Parse `<tool_call>{json}</tool_call>` markers out of the streamed
content via a chunk-boundary-safe state machine (qwen3::ToolCallParser),
surfacing them as the existing CompletionEvent::ToolCall* events
so the agent loop doesn't change.
- Re-serialise assistant turns that called tools with inline
`<tool_call>` blocks and tool results as user turns wrapped in
`<tool_response>` (qwen3::render_assistant_with_tool_calls,
render_tool_response).
Verified against cortex+Qwen3.6-27B: the model produces a
well-formed `<tool_call>{"name":"list_dir","arguments":{"path":"/tmp"}}</tool_call>`
in response to a Hermes-formatted prompt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Useful for diagnosing "the model isn't using tools" — confirming
that helexa-acp is in fact sending the `tools` array (and what
messages, system prompt, etc. accompany it) without having to
attach a packet capture upstream of cortex.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Stage 2 prompt told the model it had no tools, which models
trained for caution then dutifully repeat back ("Stage 2 build: no
tools available — I can't read files…"). Stage 3 ships tools in the
CompletionRequest.tools array, but the system message was still
overriding that. Update the default prompt to list the five tools
and instruct the model to use them rather than asking the user to
paste contents.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 3 introduces five tools (read_file, write_file, edit_file,
list_dir, bash) backed by ACP fs/* and terminal/* calls, a
ClientOps trait so the runner is mock-testable, two session modes
(default + bypassPermissions) with session/set_mode honouring them,
and a tool-call loop in the agent that streams the model, dispatches
each call, feeds results back into history, and re-enters until the
model finishes or MAX_TOOL_ROUNDS is hit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 2 lands the agent loop on top of the Stage 1 scaffold: session
state with per-session cancellation, a system-prompt builder honouring
HELEXA_ACP_SYSTEM_PROMPT_PATH / system_prompt_path TOML, and handlers
for initialize / session/new / session/prompt / session/cancel that
stream provider output back as session/update notifications. Verified
end-to-end against cortex from Zed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One assert! call grew past the line limit after the previous commits;
cargo fmt --all picked it up. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new workspace crate `helexa-acp` (binary, Apache-2.0) — the
start of "the missing ACP binary" for multi-endpoint LLM setups
mixing public APIs, private LAN deployments, and various wire
formats. Today it speaks OpenAI /v1/chat/completions; the
Provider trait is the seam that lets OpenAI Responses, Anthropic
/v1/messages, and other wire formats slot in later without touching
the agent loop.
The crate is intentionally self-contained — no dependencies on the
other workspace crates (cortex-core, cortex-gateway, neuron) — so a
future migration to a dedicated GitHub repo is a Cargo.toml-only
change. All deps come from crates.io.
This commit lands:
* `config.rs` — TOML config at $XDG_CONFIG_HOME/helexa-acp/config.toml
with multi-endpoint support (each `[[endpoints]]` declares its
name, base_url, wire_api, default_model, optional API key /
api_key_env). Falls back to env-only single-endpoint config when
no TOML exists (HELEXA_ACP_BASE_URL, HELEXA_ACP_MODEL, etc.). The
`endpoint:model` selector syntax is validated and tested.
* `provider/mod.rs` — `Provider` trait + provider-agnostic types
(`CompletionRequest`, `CompletionEvent`, `Message`, `ToolCall`,
`ToolSpec`, `Role`, `UsageStats`). Agent loop consumes these
without knowing the wire format on the other side.
* `provider/openai_chat.rs` — `OpenAIChatProvider` impl. Compatible
with cortex, LM Studio, Ollama (compat mode), OpenRouter, OpenAI
itself. Streams via reqwest + eventsource-stream + async-stream.
Surfaces text deltas, reasoning deltas (for models that emit
`reasoning_content`), tool-call lifecycle (start, args-delta,
completion), usage, finish reason. Cancellation-token aware.
* `main.rs` — tokio + stderr-only tracing-subscriber + Stdio
transport. Builds a provider per configured endpoint at startup,
surfacing config mistakes before the editor even initializes.
Currently responds to `initialize`; everything else stubs to
`not implemented yet` until the agent loop lands in the next
commit.
12 unit tests pass — encoder shape, decoder shape (text-only,
tool-call progressive, cancellation, malformed-chunk recovery),
config parsing (multi-endpoint TOML, env fallback, validation).
The `#![allow(dead_code)]` on `provider/mod.rs` is temporary — the
agent loop in the next commit reads every field. It's noted in the
module-level docstring so the next reader knows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously every inference Err — shape mismatch, NaN logits, tokenizer
error, missing handle — marked the model poisoned and rejected every
subsequent request until an operator unload+reloaded. The benjy
incident on 2026-05-27 showed how this misfires: a concurrency bug
produced a `broadcast_add: shape mismatch` error that had nothing to
do with CUDA, but the model was taken down anyway.
Add `is_device_fault(err_chain: &str)` — a conservative classifier
that returns false only for errors we know are pre-kernel / CPU-side
(shape mismatches, NaN logits, tokenize/detokenize, missing handle,
DecodeStream, empty prompt). Everything else defaults to true so a
genuine driver fault still poisons.
Applied at all six poisoning sites:
- chat_completion CUDA worker path
- chat_completion CPU spawn_blocking path
- chat_completion_stream CUDA worker path
- chat_completion_stream CPU spawn_blocking path
- chat_completion_tp non-streaming wrapper
- chat_completion_tp_stream spawned task
Each site now logs either "model marked poisoned" (device fault) or
"model NOT marked poisoned" (non-device) so the journal makes the
classification visible. Tests cover the known non-device patterns and
a couple of real CUDA driver messages.
Pairs with the inference_lock commit (c59da83): together they
eliminate both the cause of the spurious-poisoning we just observed
(the shape mismatch) AND the over-reaction to it (the unconditional
poison). Each fix is independently useful but the combination is
what makes the system actually robust to concurrent agent workloads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two concurrent chat_completion requests against the same single-GPU
model could interleave their `clear_kv_cache → forward(chunk0) →
forward(chunk1) → ...` sequences. The device-worker channel serialises
individual jobs but not the sequence boundary, so the cache could end
up holding tokens from one request while another's mask was sized for
its own prompt — producing a shape mismatch mid-prefill.
Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories`
and `memorize solutions` extensions fired 4ms apart against
Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV
cache, and request a08b4a's chunk 0 forward produced scores of shape
[1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add
failed, both requests bubbled the error up, both flipped the model to
poisoned.
Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring
the TpLoadedModel.pool lock that the TP path already held. Acquire
it at the start of `chat_completion` and inside the spawned task of
`chat_completion_stream` (so the role chunk goes out immediately and
only the inference work queues behind the lock).
The CPU branch uses `blocking_lock` from inside spawn_blocking; the
CUDA branch uses async `.lock().await` inside tokio::spawn.
Throughput impact: zero. The GPU was already serialised at the
device-worker channel — multiple requests just produced corrupt KV
cache state instead of clean serial throughput. The lock makes the
existing serialisation honest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA driver failures propagate as Err through `?` and become
`Ok(Err(InferenceError::Other(_)))` from the spawned task — those are
real device faults and still poison the model. Tokio JoinError is
different: it fires on Rust-level panic (tokenizer bug, sampler bug,
serialisation, the UTF-8 slice that landed in commit bd04d7f before
the fix) or task cancellation. Those don't touch the device context,
so failing the one request without tearing down the model is correct.
Two sites changed:
- chat_completion's CPU spawn_blocking handler — JoinError no longer
sets loaded.poisoned.
- chat_completion_tp's tokio::spawn wrapper — JoinError no longer
sets tp_for_marker.poisoned. The inner-Err case still does.
Each path logs the cause (panicked / was cancelled / ended abnormally)
explicitly so the journal makes the new behaviour obvious — search for
"model NOT marked poisoned" to find these events.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji)
across multiple tokens, the previous "decode the cumulative token list,
byte-slice the delta against a stored prefix" pattern would panic with
'start byte index N is not a char boundary; it is inside <emoji>'.
The race: at step N the tokenizer renders the partial bytes as U+FFFD
(3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4
bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the
codepoint in step N+1's `full` string, and `&str[start..]` panics.
Replace with tokenizers' `DecodeStream::step(id)` which maintains an
internal byte buffer across token boundaries and only emits when a
clean codepoint completes. Applied at all three SSE emission sites:
- stream_inference_via_worker (single-GPU CUDA stream)
- chat_completion_tp_stream's spawned task (TP stream)
- run_inference_streaming (CPU stream)
The shared emit helper splits into emit_delta (async, mpsc::send) and
emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its
existing send semantics. The old emit_chunk helper that did the
unsafe full-decode-and-slice is removed entirely.
Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call
response after a long agent-zero session; the spawned TP stream task
panicked at candle.rs:2648. The model itself stayed healthy (no CUDA
fault), only the one streaming request died.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload
cycle observed on beast under agent-zero workloads. Three changes,
all keyed off env-driven knobs so an operator can tune without a
rebuild:
1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The
initial forward is split into N-token windows, each with a
monotonically growing offset. KV cache accumulates across chunks
exactly as it would under one big prefill; only the final chunk's
logits are kept for sampling. Activation memory now scales with
chunk size instead of prompt length, so a 13 k-token prompt stops
holding tens of GB of intermediate activations live at once.
Wired into all six prefill call sites:
- run_inference / run_inference_streaming (CPU path)
- run_inference_via_worker / stream_inference_via_worker (CUDA
single-GPU through device worker)
- chat_completion_tp_inner / chat_completion_tp_stream (TP via
WorkerPool)
Three helpers — chunked_prefill_local, chunked_prefill_via_worker,
chunked_prefill_tp — own the loop shape so the chunking semantics
stay identical across paths. Per-chunk debug log shows progress.
2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384).
Requests above the cap return a structured 400 with
`code: prompt_too_long` rather than going through the prefill and
discovering the limit by OOMing partway through. New
InferenceError::PromptTooLong variant.
3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500).
If `vram_free_mb` is below the threshold at request start (e.g.
another concurrent request is mid-prefill), reject with a clean
503 + `code: insufficient_vram` rather than starting work that
will OOM. New InferenceError::InsufficientVram variant. CPU loads
(vram=0 sentinel) skip this check.
All three gates fire BEFORE any device work, so a rejected request
costs ~one tokenisation pass and never touches the worker thread —
poison cascades from rejected work are now impossible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sccache occasionally fails mid-compile with race-condition errors that
clear on a re-run without any code changes. Rather than tracking that
down right now, wrap the two affected steps in a bash loop that retries
up to three times with a 5-second pause. Real failures still surface;
they just take ~10s longer to fail.
fmt is left as a single invocation — it's a one-shot syntactic check,
not a build, and isn't subject to the same sccache races.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production deployments that want neuron-internal debug detail (e.g.
trim_device_pool's per-clear-kv line, slab inserts/drops) override
RUST_LOG explicitly via systemd. Defaulting to debug for the whole
neuron target produced a lot of journal volume that wasn't useful in
the common case.
beast already sets RUST_LOG=debug in
/etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity
is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
q5k produced NaN logits on Qwen/Qwen3.6-27B under candle TP=2 (sampler
fell over with "logits unhealthy nan: 248320/248320"). q6k is the
quant that worked well in production under mistral.rs on the same
hardware, so it's the right baseline for verifying the mempool-trim
fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync
returns memory to the device's default mempool, not to the OS), so
mem_get_info under-reports free VRAM between requests. With
Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw
~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into
CUBLAS_STATUS_INTERNAL_ERROR depending on quant.
Add a cuda-gated trim_device_pool helper that, after each successful
clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool,
0) against the device's default mempool. Failures (no async-alloc
support, transient driver errors) are non-fatal and log at debug. The
before/after free-VRAM delta is logged so an operator can correlate the
trim with the next request's prefill VRAM.
ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors
correctly; the leak was strictly at the cudarc pool layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md.
CLAUDE.md:
- New "Per-device worker thread (neuron)" section under Key design
decisions, covering the three load-bearing properties (context
locality, drop safety, poisoning blast radius), the CPU-fallback
exception, and pointers to the canonical narrative in
crates/neuron/src/harness/device_worker/mod.rs's module doc-comment.
- New 2026-05-27 addendum dating the migration and naming the four
PR commits (Phase 1: 081b532, Phase 2: b179204, Phase 3: 76ab24d,
Phase 4: b4f3576). Same convention as the 2026-04-15 and 2026-05-18
addenda.
README.md:
- One paragraph in "Node setup" noting the per-device thread pattern
with a pointer to CLAUDE.md and the device_worker module.
No code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final structural slice of the per-device CUDA context-ownership
refactor. The four remaining spawn_blocking sites that did CUDA work
on the leader are gone:
- Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) →
`Job::LoadGguf` dispatched on the worker.
- Single-GPU dense load (`load_arch_dense` spawn_blocking) →
`Job::LoadDense` on the worker.
- TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) →
`Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()`
directly — no cross-thread `Arc<Comm>` transfer, no `SendComm`
wrapper for this path.
The Phase 2 / Phase 3 bridges that moved freshly-built models across
the channel boundary (`Job::TransferIn`, `Job::TransferInTp`,
`Job::CloneLeaderComm`) are removed. Models are now constructed on
the worker thread directly; the slab gets populated by `insert_arch` /
the inline `tp_models.insert` in dispatch handlers.
What this phase preserves:
- CPU loads still use `tokio::task::spawn_blocking` against
`Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and
channel overhead would only add latency. Four `spawn_blocking`
references remain in `candle.rs` (load_arch_gguf, load_arch_dense,
chat_completion, chat_completion_stream) and all are deliberate
CPU-only fallback.
- Public API unchanged. `Harness::load_model`, `chat_completion`,
HTTP routes all keep identical signatures.
What this phase removes:
- `SendComm` wrapper is no longer used in the load path (the Phase 3
bridge that justified it). It remains in `nccl_state.rs` for the
Phase 1–3 era and any future cross-thread Comm move; consider
deleting in a follow-up.
- `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and
their handle convenience methods deleted.
- The leader_device parameter on `load_dense_shard` is now `_` —
unused since the worker has its own bound device. Removing the
arg outright is a public-API change; keeping the underscore prefix
preserves the signature and signals deadness without churn.
Helper relocation:
- `LlamaDense::from_parts` is a new pub(crate) constructor so the
worker-thread loader can build a `LlamaDense` without going through
the original `load_arch_dense` async function.
- `check_dense_config_supported` is bumped to `pub(crate)` for the
same reason.
Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/`
returns only CPU-fallback hits in `candle.rs` + doc-comment references
to the old design. All four leader-side CUDA `spawn_blocking` sites
are gone.
fmt + clippy clean; 37 lib tests + all integration tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The
leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers,
the leader-side KV cache reset, and the TP forward step itself now all
run on the per-device worker thread — the same OS thread that bound
the leader's `CudaContext` at startup.
What this phase changes:
- `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3
bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`,
`TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key.
- `DeviceWorkerState` gains `nccl: NcclState` and
`tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter).
- `WorkerPool` loses its `leader_nccl` field; gains a
`leader_worker: Arc<DeviceWorkerHandle>` passed at construction.
`init_nccl`, `nccl_sanity_check`, `load_dense_shard`,
`generate_step`, `clear_kv_cache` all route their leader-side ops
through `Job::Nccl*` / `Job::Tp*` instead of spawn_blocking against
a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead
of a device-resident `Tensor` — the worker copies logits to CPU
before reply so the async caller can sample on a CPU candle
tensor with zero device-context touch.
- `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque
`leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the
worker thread's slab; both the model's CUDA tensors and the
embedded `Arc<Comm>` clones release on the same thread that
allocated them (the Drop semantics constraint cudarc forces).
- `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still
runs in spawn_blocking and needs the leader's `Arc<Comm>` to build
the row-parallel layers' AllReduce ops. The Job clones the Comm
out of the worker's NcclState and ships it back as `SendComm`.
Phase 4 deletes this bridge when the load itself moves onto the
worker.
- `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the
no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`)
still flow through the same channel uniformly; the cuda-only
TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp)
remain gated.
What this phase doesn't touch (yet):
- TP shard load itself — still spawn_blocking, bridged via
`CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and
reads `state.nccl.comm()` directly inside the worker.
- Single-GPU model loads — still spawn_blocking, transferred via
`Job::TransferIn`. Phase 4 moves them.
- `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete`
helpers — still present, used inside spawn_blocking load closures.
Phase 4 cleanup folds them into `dispatch.rs`.
`tp/mod.rs::WorkerPool::spawn` gained a required
`leader_worker: Arc<DeviceWorkerHandle>` argument. Three external
callers were updated: `CandleHarness::load_tp` (passes the cached
device worker), `main.rs::tp_smoke` (spawns a fresh worker), and
the two `tp_worker_lifecycle*.rs` integration tests.
Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass. CUDA-only TP integration smoke deferred to
the next deploy on beast.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two
spawn_blocking sites in `chat_completion` and `chat_completion_stream`
now route through the device worker thread on CUDA loads. CPU loads
keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's
no context to own and the channel hop would only add latency.
What this phase changes:
- `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The
worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>`
slab and a `next_handle` counter for minting opaque handles.
- `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`,
plus a new `arch_handle: Option<ArchHandle>` field. The two are
mutually exclusive: CUDA loads set `arch_handle = Some(_)` after
transferring the boxed arch into the worker's slab; CPU loads keep
`arch = Some(_)` for the legacy spawn_blocking path.
- New `run_inference_via_worker` and `stream_inference_via_worker`
drive the prefill + decode loop by sending `Job::ForwardLogits` per
step; the worker copies the resulting `[vocab]` logits to a
CPU-side `Vec<f32>` before reply, so the async caller never holds a
device-resident tensor. `apply_repeat_penalty` and
`LogitsProcessor::sample` run on a CPU candle tensor; no context
binding side-effects on tokio worker threads.
- `logits_health_slice(&[f32])` complements the existing
`logits_health(&Tensor)` so the new worker paths can compute
health stats directly from the CPU vec.
- `unload_model` for the single-GPU CUDA path now sends
`Job::DropArch { handle }` to the worker so the `Box<ModelArch>`
drops on the thread that allocated its CUDA tensors. The `Drop` runs
with the bound context, freeing memory on the right context.
What this phase doesn't touch (yet):
- TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3.
- Single-GPU model load — still spawn_blocking, followed by a
`Job::TransferIn` to move the freshly-built `ModelArch` into the
worker slab. Phase 4 moves the load itself onto the worker thread
and eliminates the bootstrap TransferIn.
- The `device_vram_mb` / `cuda_mem_mb` helpers — still present and
used by the construction-time logs running inside spawn_blocking
loads. Phase 4 cleanup folds them into `dispatch.rs`.
Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the
infrastructure for a dedicated OS thread per CUDA device that owns the
device's `CudaContext` for the daemon's lifetime, and routes the 8
async-context `device_vram_mb()` call sites in candle.rs through it.
What this phase changes:
- New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs).
`DeviceWorkerHandle::spawn(idx)` creates a named OS thread
(`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and
enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel.
Replies cross back via `tokio::sync::oneshot::Sender` so async callers
await without parking a tokio worker.
- Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward,
ClearKv, NCCL init/sanity, and load variants.
- `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at
load time by a new `CandleHarness::ensure_device_worker(idx)` method
that lazily spawns + caches one worker per device index.
- Per-model `query_vram()` convenience method on both struct types so
the 8 call sites in chat_completion / chat_completion_stream /
chat_completion_tp_inner / chat_completion_tp_stream become
`loaded.query_vram().await` (or `tp.query_vram().await`) — same field
values logged, just sourced from the owner thread instead of the
caller thread.
What this phase doesn't touch (yet):
- Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`.
Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP
forward + NCCL bring-up; Phase 4 moves the loads and deletes the now-
unused `device_vram_mb` / `cuda_mem_mb` helpers.
- Public API — unchanged. `Harness::load_model`, `chat_completion`,
HTTP routes all keep identical shapes.
Tests:
- 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query
→ shutdown round-trip, thread naming, post-shutdown submit returns
`Gone`, poisoned flag fast-rejects, and concurrent jobs drain across
a Shutdown. CPU build (the only one CI runs) is enough to exercise
channel mechanics.
- All 37 lib tests + all integration tests pass; fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B
incident from "guess at KV cache / quant sizes" into "read the
journal":
1. Construction-complete summary in TpQwen3_5ForCausalLM::load and
TpQwen3ForCausalLM::load. After the last "after layer N" log fires,
each rank emits a single info line with: free_mb/total_mb (the
number that drops by ~9 GB between per-layer and first-request on
beast, with no inference traffic), every resolved config knob
(vocab_size, hidden_size, num_layers, head_dim, num_kv_heads,
max_position_embeddings), and a per-token KV-cache byte estimate.
For Qwen3-Next also includes the linear/full-attention layer split
so the hybrid architecture's cache cost is unambiguous.
2. Logits health snapshot on sample failure. Today the failure logs
"A weight is negative, too large or not a valid number" with no
context — was it a NaN cascade, an Inf, a negative weight?
`logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts
plus finite_min/max/mean on the failure path (zero cost on the
success path) and emits a warn line just before the wrapper's
terminal "failed, model marked poisoned" log. Wired into both the
prefill and decode sample sites of the non-streaming AND streaming
TP chat paths.
3. VRAM snapshot at prefill complete + every decode step. The
"prefill complete" info line now carries vram_free_mb so the
activations + KV growth from the prefill itself is visible. The
per-step trace line gets vram_free_mb too, so an operator running
with RUST_LOG=trace can watch headroom shrink token by token.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operators can now define tier aliases in models.toml:
[aliases]
"helexa/small" = "Qwen/Qwen3-1.7B"
"helexa/balanced" = "Qwen/Qwen3-8B"
"helexa/large" = "Qwen/Qwen3.6-27B"
A client request for `model: "helexa/small"` is resolved to the concrete
model id at routing time. The gateway also rewrites the proxied body's
`model` field to the concrete id so neuron sees a name that matches its
loaded handle (otherwise the harness rejects the request).
Motivated by the finger-in-the-wind benchmark: same "what's the capital
of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with
identical correctness. Aliases let clients pick a latency tier without
hardcoding model ids, and let operators swap targets without changing
client code.
Changes:
* cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>`
+ `resolve_alias(&str) -> &str`. Unit tests cover the basic
resolution + TOML round-trip.
* cortex-gateway:
* `RouteDecision` gains `resolved_model_id: String`. `router::resolve`
consumes aliases at entry and threads the concrete id through.
* Handlers (chat_completions, completions, anthropic_messages
streaming + non-streaming) rewrite the body's `model` field with
`rewrite_model_in_body` before proxying, using the resolved id
for metrics labels, LRU touch, and the body itself.
* `/v1/models` (Pass 4) emits each alias as its own entry mirroring
the target's `loaded` flag, feasible_on, and locations — clients
browsing the endpoint see both names and can pick either.
* `models.toml` declares the three tier aliases; `models.example.toml`
documents the section as opt-in.
* Integration tests verify: end-to-end alias→concrete request flow,
alias surfacing in /v1/models, and no-op fall-through for
non-alias model ids.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds wait_for_ready() that polls /health until activation.state flips
to "ready" (or the NEURON_LOAD_TIMEOUT deadline). Inserted between
probe_health and the is_loaded/trigger_load step.
Before this, running validate-neuron.sh right after deploy.sh raced
the background pre-warm and failed in ~9 ms with "neuron not reachable"
(the pre-2026-05-26 build) or with a partial-load error (the new
build, where the listener binds before default_models finishes).
The poll prints the in_progress model on each tick so an operator
watching the log can see which model is delaying readiness. Backs off
from 2s to 10s after the first few iterations so a long TP load
doesn't spam.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The poller now fetches /health alongside /models on each neuron and
stashes the activation snapshot on NodeState. The /v1/models handler
gains a Pass 3 that synthesises Loading locations from each neuron's
activation.in_progress and activation.pending lists, so a catalogued
model that's mid-prewarm surfaces as `status: "loading"` rather than
appearing absent (loaded=false, locations=[]).
Without this, a client polling /v1/models during a beast restart sees
Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then
reappear. Now it stays visible the whole time with a clear status.
Adds ModelStatus::Loading to cortex-core. The router's per-node priority
loop gets an explicit (no-op) arm: Loading models aren't routable yet,
and falling through to the catalogue cold-load path is the existing
race — no worse than before, but tagged as a known follow-up needing
neuron-side in-flight tracking on /models/load.
New test_poller_captures_activation_from_health exercises the full
round-trip: mock neuron with empty /models but a pre_warming /health
→ poller writes node.activation. Common test helpers gain
spawn_mock_neuron_with_models_and_health and default_health_response.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.
1. main.rs no longer awaits load_default_models before binding axum.
The listener binds first; pre-warm runs in a spawned background
task that holds a read lock on the harness registry for the
duration of its sequential load loop. Concurrent on-demand
/models/load and /v1/chat/completions traffic still flow.
2. /health gains an `activation` field carrying:
state pre_warming | ready
pending model ids queued but not started
in_progress model id currently loading (Option)
completed model ids loaded successfully this activation
failed [{model_id, error}] for failed entries
The field is `#[serde(default)]` so a pre-change cortex polling a
new neuron — or vice versa — keeps working.
`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.
Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds asset/neuron/{beast,benjy,quadbrat}.toml — per-host neuron.toml
files keyed by the first dot-component of the host. deploy.sh now
rsyncs the matching file to /etc/neuron/neuron.toml on each neuron and
stops+starts the service so default_models is re-read.
Headline model per host (drives /v1/models output immediately after a
clean deploy):
beast Qwen/Qwen3.6-27B (q5k, tp=2, devices=[0,1])
benjy Qwen/Qwen3-8B (bf16, devices=[0])
quadbrat Qwen/Qwen3-1.7B (bf16, devices=[0])
Removes the need to follow deploy.sh with `validate-neuron.sh beast
Qwen/Qwen3.6-27B q5k 2` to surface the 27B in the catalogue — the
neuron loads it itself on activation.
The neuron loop now mirrors the cortex flow (stop → install/upgrade →
sync config → start) so config-only changes pick up on subsequent
deploys; previously a no-package-change deploy would silently leave
the host on the old default_models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifetime elision fails when a function has two reference parameters
and returns a borrow: rustc can't infer whether the MutexGuard's
lifetime ties to `pool` or `model_id`. The non-CUDA build skipped
this code path (cfg-gated), so the error only surfaced on the GPU
build at https://git.lair.cafe/helexa/cortex/actions/runs/162.
The guard borrows the pool, so name the lifetime on `pool` and the
return type. `model_id` keeps its independent (elided) lifetime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two failure modes from the 2026-05-26 beast incident:
1. `unload_all_models` looped through models calling `unload_model`,
logging individual failures at warn. The cumulative effect was a
single warn line for the failed unload then "shutdown complete" —
no signal that the model was actually still loaded. Now each unload
is bounded by a 20s timeout, failures escalate to error, and a
summary "leaving N model(s) loaded" line fires when anything is
stuck so the operator knows the OS will reclaim VRAM after exit.
2. Returning `Ok(())` from `main` after the unload sweep dropped the
tokio runtime, which then waited indefinitely on a CUDA-stuck
spawn_blocking thread (the journal's "Stack trace of thread
2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec
fired 2 minutes later, SIGABRT, core dump. Replacing the return
with `std::process::exit(0)` skips the runtime drain and hands the
OS a clean exit code; stuck threads get reaped with the process.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Once a CUDA driver error has hit a forward or kv-cache call, the
device's context is unrecoverable in-process — subsequent kernels can
hang (the failure mode seen on beast on 2026-05-26), return garbage,
or trip another illegal-address. The harness now marks the model
poisoned on any forward / spawn_blocking / TP-task failure, refuses
further inference against it with a clear "unload and reload" error,
and surfaces `status: "poisoned"` on `/models` so an operator running
`curl beast:13131/models` (or cortex polling) can see the bad state.
Without this, a single OOM on a too-large prefill quietly turned every
subsequent request into a stuck wait on the pool lock; with it, the
first request fails fast with the driver error in the journal and the
client gets a usable 5xx instead of a hung connection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every "starting" log line now carries vram_free_mb / vram_total_mb for
the request's serving device (the leader device on TP). On the 2026-05-26
incident this would have made the 14k-token prefill OOM diagnosable from
the first log line: with ~412 MB free, that prompt was never going to
fit, and the operator could have caught the imbalance before the CUDA
context got poisoned.
`device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is
kept separate to avoid coupling the inference path to the TP module.
TpLoadedModel gains a `leader_device: Device` clone so the request
path reads the device without locking the leader model (which would
contend with an in-flight forward).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every chat completion path (single-GPU + TP, streaming + non-streaming)
now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber
prefixes every event with that span so `grep req_id=…` over journalctl
reconstructs one request even when dozens overlap.
Every path also emits a terminal log line on both success ("done", with
prompt_tokens/completion_tokens/finish_reason/total_ms) and failure
("failed", with full anyhow chain + total_ms). Failures used to vanish
silently — a request that hit a CUDA OOM left "starting" in the journal
and no further trace.
New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await`
in both TP paths. It warns at 2s ("still waiting on pool lock") and
re-warns every 2s thereafter, so queued requests stuck behind a
deadlocked holder are visible immediately instead of looking like idle
silence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker rank failures were already surfaced at WARN, but the leader's
own forward Result::Err was silently coerced to a `leader_ok=false`
bool. When the leader and a worker both fail together — the typical
shape of a CUDA OOM cascading into an illegal-address — the journal
showed only the worker side and an operator had to guess what hit
rank 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive sweep across cortex-gateway's request handling. Every
failure path now emits exactly one structured warn (or error) event
on the cortex side with the wire-level detail an operator needs;
the API response carries only a generic message plus, where useful,
the upstream status code.
proxy.rs::forward_request:
- warn on network failure (network error, target URL).
- warn on upstream non-2xx (status, target URL). Streaming body still
passes through to the client; we just can't snippet without
breaking the stream.
- warn on response-build failure.
- ProxyError::into_response no longer interpolates the inner error
into the API body — generic "upstream request failed" / "failed to
build response" instead.
handlers.rs::chat_completions, handlers.rs::completions:
- warn on missing model field, with handler= label.
- warn on route resolve failure with model + error chain. The
user-facing 404 keeps the RouteError Display string (which is
short, informative, and contains no internal detail beyond the
model id and config'd node names).
handlers.rs::anthropic_messages:
- warn on invalid Anthropic body, on translated-OpenAI serialise
failure (which is internal), on route resolve, on upstream network
error, on upstream non-2xx (with 512-char body snippet for parse
errors), on upstream body read, on response parse.
- All warns share consistent field shape: handler, model, node, url,
status / error / body as applicable.
- API response messages are now uniformly generic.
- Adds an info-level "proxying request" log on the non-streaming
path so successful proxies are also visible.
handlers.rs::proxy_with_metrics:
- still calls e.into_response() but proxy::forward_request already
warn'd at the wire layer, so no double-log here.
Tests:
- All 32 existing unit tests + 22 gateway integration tests + 4
new router tests pass.
- Tests that asserted on the "no healthy nodes" / "not found"
strings still match because RouteError messages are preserved
in the 404 user-facing path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled bugs surfaced after 9b0ed0b:
1. url::Url::parse("http://host:port").to_string() normalises the
empty path to "/", so rewrite_loopback_host was returning
"http://beast:13131/". Downstream callers then did
format!("{endpoint}/v1/chat/completions") and produced a
double-slash path that neuron's axum router 404'd with an empty
body. Strip the trailing slash in the rewriter so the endpoint is
a clean base string for concatenation.
2. The anthropic_messages handler returned the upstream's empty body
to the API caller as `"upstream error: "` with no journal log on
the cortex side. Operators had no way to see what happened. Add
warn-level tracing on both upstream failure paths (network error
and non-2xx) with model, node, target URL, status, and a 512-char
body snippet. The API response now carries just `"upstream
returned <status>"` — the implementation detail lives in the log.
Updates the two existing rewrite tests for the no-trailing-slash
output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Neuron hardcodes its bind_url as `http://localhost:13131` (it can't
reliably know its own externally-resolvable name). When cortex runs
on a different host than the neuron it's routing to, blindly
proxying to that URL hits localhost on the cortex box instead of the
neuron.
Cortex already knows each neuron's reachable host from cortex.toml.
After fetching the inference URL from `/models/{id}/endpoint`, if
the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1),
swap it for the configured neuron host. Preserve the port and path
from neuron's URL so a future harness serving inference on a
different port than the management API still works.
Adds `url` (already a transitive dep via reqwest) as a direct
dep for the URL parsing.
Tests cover: localhost rewrite, distinct inference port preservation,
non-loopback passthrough, malformed input.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a %posttrans scriptlet to cortex.spec that:
- Removes the stale /etc/firewalld/services/helexa-cortex.xml left
behind by an older packaging stream that named the service
`helexa-cortex` and (in some build streams) carried wrong port
numbers (9301/9302/9304).
- Walks every active firewalld zone; for any zone where the legacy
helexa-cortex service was enabled, swaps it out for the new
`cortex` service (which the RPM ships at
/usr/lib/firewalld/services/cortex.xml with the right
31313/31314 ports).
- Reloads firewalld so the change takes effect without operator
intervention.
Operators on whom this happened were silently dropping inbound
connections to cortex on 31313 — the active zone advertised a
helexa-cortex service that listed unrelated ports, masking the
correctly-defined vendor cortex service.
helexa-neuron is unaffected: that spec already ships the vendor
service as helexa-neuron.xml (namespaced from day one) and no
stale /etc override files exist in the fleet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the
load spec has quant set and tie_word_embeddings is false, lm_head's
(vocab_size, hidden_size) weight is quantized in-situ at load time
along with all the per-layer linears. The non-tied case on
Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2
bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of
decode latency from the per-token logits matmul.
Tied case (tie_word_embeddings=true) keeps the lm_head plain even
when quant is set — quantizing the shared tensor would corrupt the
embedding lookup, and the tied case already gets the memory win
from only holding one copy.
This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path.
The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's
the bottleneck for a model that actually needs TP at consumer scale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The M > 8 threshold from 8e-2d activated forward_via_f16 on the test
case (M=30) and slightly regressed prefill (143 -> 133 T/s). The
dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms)
eats the cuBLAS GEMM speedup at small M.
Move the crossover to M > 64 so short prefills (typical for the
validate probe) stay on the GGUF GEMV kernel where per-call cost is
comparable but the dequant tax is zero. Long prefills still get the
dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the
fixed dequant cost.
Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that
needs either a dequant cache (gives back the ISQ memory win) or a
fused dequant+gemm kernel. Both larger projects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaybeQuantLinear::forward picks between two QMatMul paths:
- M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight
once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost
is fixed per call, so it's amortised across the M tokens.
- M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel
on the quantized blocks directly. Requires f32 inputs so we still
cast in/out at the boundary in that arm.
Earlier 8e-2c sent everything through the GGUF GEMV kernel, which
is excellent at GEMV (decode) but doesn't have a real batched GEMM
path — prefill regressed ~4x. This restores prefill to roughly the
bf16 cuBLAS GEMM throughput while keeping the decode gain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly
GGUF dequantize accumulates in f32. The model dtype flowing into
MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with
"unexpected dtype, expected: F32, got: BF16".
Wrap the Quant arm to cast the activation to f32 before the matmul
and cast the result back to the input dtype. The cast is a single
launch on the activation tensor (small relative to weight traffic);
it's the price of in-situ GGUF-style quantization, and what mistralrs
does inside its own Linear wrapper.
The Plain arm is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-existing guard in candle.rs rejected any spec.quant on the TP
path with "GGUF quantized models are not supported in the TP path" —
written when quant only ever meant GGUF. With 8e-1/8e-2 in,
quant != None on the TP path triggers in-situ quantization of the
loaded safetensors shards. resolve_dense_files only looks for
safetensors so a GGUF-source-file model with TP still errors out
cleanly downstream.
validate-neuron.sh: rebuild the load payload incrementally so
tp_size > 1 + non-empty quant produces both fields. Same script now
covers all four combos (single/TP × dense/ISQ).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- LoadDenseShard RPC gains an optional `quant` string field.
- WorkerPool::load_dense_shard takes a `quant: Option<String>`,
passes it via the RPC to workers and via parse_quant_string to
the leader's local load.
- The Qwen3-Next TP load chain (ForCausalLM → Model → DecoderLayer
→ Attention / GatedDeltaNet / MLP) takes `quant: Option<GgmlDType>`
end-to-end, calling Column/RowParallelLinear::load_with_quant.
- The fused in_proj_qkv inside TpQwen3_5GatedDeltaNet is now a
MaybeQuantLinear so it also picks up quantization.
- parse_quant_string accepts q4_0/q4_1/q5_0/q5_1/q8_0/q8_1, q2k..q8k
(with or without underscore), and f16/bf16/f32. Empty / None means
no quantization.
Callers from candle.rs forward spec.quant through pool.load_dense_shard.
This means a `quant = "q5k"` in models.toml now flows end-to-end to a
QTensor-backed QMatMul for every per-rank linear in the Qwen3-Next
TP path. Leaves lm_head and the small replicated bias/log tensors in
their loaded dtype (Stage 8e-3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces MaybeQuantLinear, which wraps either a plain candle Linear
or a candle QMatMul backed by a freshly-quantized QTensor. Forward
dispatches identically through the Module trait so downstream code
doesn't care which arm is active.
ColumnParallelLinear and RowParallelLinear gain `load_with_quant`
methods. The existing `load` methods stay as backward-compatible
no-quantization wrappers — no churn at the 27 existing call sites.
This is the foundation for in-situ quantization at load time. Wiring
the user-facing quant config and switching call sites to
load_with_quant follow in stages 8e-2 / 8e-3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces load_fused_qkv_slice_2d/_3d with reads from a separate
MmapedSafetensors handle. Each per-rank fused tensor is built by
reading the three region byte-slices directly from the mmap,
concatenating them host-side, and uploading as one device
allocation — no full-fused-tensor device materialisation.
The prior approach allocated a ~100 MB transient device tensor
per linear-attention layer; on Qwen3.6-27B with 48 linear-attn
layers that's ~4.8 GB of allocator churn during load — enough
to fragment the cuda caching allocator on a tight-VRAM 32 GB
consumer GPU, which is what triggered the layer-22 up_proj
OOM seen on beast.
Threading: MmapedSafetensors flows worker → ForCausalLM →
Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs)
and worker (worker.rs) construct their own mmap; Linux's page
cache shares the underlying pages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps each TpQwen3_5DecoderLayer::load in a with_context that captures
free/total VRAM on failure, plus an info-level log after every layer
that succeeds. Uses cudarc::driver::result::mem_get_info — same API
mistralrs uses.
Diagnostic only: forward path is unchanged. Helps distinguish true
VRAM exhaustion from allocator fragmentation when loading large
models at BF16 on 2x consumer GPUs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_fused_gating helper consolidates the per-layer gating math:
beta = sigmoid(b)
g = -exp(a_log) * softplus(a + dt_bias)
CUDA path issues a single launch via fused_gdn_gating_cuda;
cpu path falls back to the original per-op Rust sequence. Replaces
~10 candle launches per linear-attention layer (sigmoid + 2× to_dtype
+ exp + neg + broadcast_add + softplus + 2× unsqueeze + broadcast_mul)
across both single-GPU and TP forward paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_delta_rule_cuda now picks between the per-token kernel and the
BT=64 chunked variant based on seq_len. Threshold = 64 matches mistralrs.
Prefill on Qwen3.6-27B (typical seq_len in the hundreds) drops from
one block-launch per token to one per 64-token chunk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the per-layer conv1d + silu sequence in both single-GPU and
TP linear-attention forward paths with a shared run_causal_conv1d
helper that dispatches to:
- causal_conv1d_update for decode (seq_len=1 with existing conv_state)
- causal_conv1d_full for prefill / fresh start (zero-pads internally)
Both kernels fuse the depthwise conv + SiLU into a single launch — 4×
fewer cuda launches per linear-attention layer vs the candle conv1d +
candle_nn::ops::silu combo. Falls back to the original Rust path on
cpu.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TP per-token Rust loop replaced with shared run_delta_rule dispatch
from arch/qwen3_5/linear_attn.rs. Both single-GPU and TP variants now
use the cuda kernel when available, per-token Rust fallback otherwise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the per-token Rust delta-rule loop in
`arch/qwen3_5/linear_attn.rs::GatedDeltaNet::forward` with a single
dispatch to the `gated_delta_rule_recurrence` kernel imported from
mistralrs in 1ebbe87.
The kernel is V-tiled with compile-time BK (one block per (V-tile,
batch*head), one thread per V-column, BK state floats in registers).
For Qwen3.6's per-rank `(B=1, H=24, D_k=128, D_v=128)` shape this
collapses ~6 candle tensor-op launches per token per layer (each
~50µs CUDA dispatch overhead, so ~300µs/token/layer × 48 linear-
attention layers = 14ms in launch overhead alone) to a single
kernel launch with full ILP / register residency.
New free function `run_delta_rule`:
- cuda branch (when q is on a CUDA device): flattens
`(B, H, ...)` → `(BH, ...)`, dispatches the kernel via
`crate::cuda::gdn::gated_delta_rule_recurrence_cuda`, reshapes
outputs back to `(B, H, L, D_v)` and state to `(B, H, D_k, D_v)`.
- cpu fallback: the original per-token Rust loop, unchanged. Keeps
cargo test --workspace passing on hosts without cuda.
Dispatch decision lives in the wrapper (`q.device().is_cuda()`).
Build: `cargo build -p neuron --features cuda` compiles + links;
clippy clean on both CPU and cuda paths. 32 lib tests still pass
(none of them exercise this code path on cuda; smoke test for the
TP variant is the deployed Tbilisi probe).
Stage 8d-3 wires the conv1d kernels; 8d-4 the chunked prefill;
8d-5 the same wiring for `tp/tp_qwen3_5.rs`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 8d (new): port the Gated DeltaNet CUDA kernels from
EricLBuehler/mistral.rs to close the ~500x decode performance gap
we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path
vs ~37 T/s in mistralrs on the same hardware).
This commit lays the build infrastructure with zero behavioural
change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the
qwen3_5 architecture and TP variant.
Added:
- `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile
every `src/cuda/*.cu` file into `libneuroncuda.a` under the `cuda`
feature, then links it + `cudart`. Mirrors mistralrs's
`mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same
sm_<80 bf16 gate).
- `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from
upstream:
* `gated_delta_rule_recurrence` (V-tiled per-token decode)
* `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill)
* `causal_conv1d_update` (single-token conv decode)
* `causal_conv1d_full` (multi-token conv prefill)
* `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) *
softplus(a + dt_bias))
- `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels,
cudarc::CudaSlice::device_ptr boilerplate identical to upstream.
- `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of
upstream's ffi.rs covering only the five GDN kernels; MoE / SSM /
top-k decls land here when we absorb those too).
- `crates/neuron/src/cuda/mod.rs` — re-exports + module docs.
Cargo wiring: `cudaforge` added as an optional build-dep, activated
by the `cuda` feature. CPU build is unchanged (the `cuda/` module is
fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the
patched container compiles `gdn.cu` (1 of 1 kernels) and links
clean.
Licensing: upstream files preserve their MIT origin via per-file
comment banners pointing to the mistralrs path. No behaviour-relevant
edits to the .cu kernels — local diff against upstream is just the
banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream;
their structure (module path `crate::cuda::ffi::*`) matches identically
so future kernel imports drop in unchanged.
CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean
inside the runner container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes addressing operator visibility into TP inference + the
HTTP-cancellation poisoning chain:
1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When
the HTTP client disconnects (curl --max-time, browser nav, etc.)
the future returned from `chat_completion_tp` gets dropped, but
the spawned task keeps running to completion — finishing every
`pool.generate_step` / `pool.clear_kv_cache` to drain the worker
pipes. The next inference request then finds a clean pool.
Previously: dropped future left workers still processing the
in-flight request, the next call's `ClearKvCache` recv would
read the stale `GenerateStepOk` from the abandoned step ("rank N
expected KvCacheCleared, got GenerateStepOk"). The drain-on-
leader-error fix from d1a4aad covered Rust-side leader failures
but not HTTP-layer cancellation, which is what we actually hit
on the user's Qwen3.6 test.
2. Tracing throughout the TP path so journalctl shows where an
inference spends its time without needing to surface harness
internals via the HTTP error body:
- `chat_completion_tp_inner` (now a free fn so it can run inside
spawn): `info` at request start (prompt_len, max_new, temp,
top_p, eos_id), `info` per major phase (prefill complete with
elapsed_ms, decode complete with elapsed_ms + token count),
`info` at completion (total_ms, finish_reason). `debug` for
pool-lock acquisition + kv-cache clear timing. `trace` per
decode step (next_token, step_ms).
- `WorkerPool::generate_step` (leader side): `debug` at fan-out,
`debug` after leader forward returns with elapsed_ms + ok flag,
`debug` after drain with errors count + total_ms.
- `WorkerPool::clear_kv_cache`: matching `debug` at fan-out + drain.
- `worker::handle_generate_step`: `debug` at forward start + done
with elapsed_ms, `warn` on forward failure with the full error.
The default log filter is already `info,neuron=debug` so the
operator gets every `info` and `debug` line by default; `trace`
needs RUST_LOG=trace for per-step decode timing.
Stage 7c-ii crash-detection is still future work; this is the
minimum that makes the "where did the 120s go" question answerable
from the logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TP-2 inference probe against Qwen3.6-27B surfaced:
worker rank 1 ClearKvCache: expected KvCacheCleared, got
GenerateStepOk
Caused by pipe poisoning. The previous shape of `generate_step`:
for w in workers { w.send_only(GenerateStep) } // 1. fan-out
let logits = spawn_blocking(leader.forward)??; // 2. early return on err
for w in workers { w.recv_only() } // 3. drain (skipped on 2's err)
When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen
before, an OOM, a downstream squeeze that didn't match the shape),
the function bailed before step 3 — but workers had already written
`GenerateStepOk` to their stdout pipes, since their forwards (and
the NCCL collectives inside) completed independently of the leader's
post-collective Rust-side work.
The next call (typically `ClearKvCache` at the start of the *next*
inference request) would then send a fresh request and read those
stale replies as if they were the new operation's. Once a pipe is
poisoned, every subsequent call surfaces the same shape of error
even though nothing's actually broken.
Fix: introduce two helpers in `tp/mod.rs`:
- `drain_workers(workers, check)` — reads exactly one response from
every worker regardless of individual outcomes. Returns
`Vec<String>` of `rank N: detail` strings for any non-OK reply.
- `combine_leader_workers(leader, worker_errs, op)` — folds the
leader's `Result<Result<T>>` (the spawn_blocking shape) with the
worker drain into a single `Result<T>`. Leader failure takes
precedence but worker errors get appended so both halves surface.
`generate_step` and `clear_kv_cache` now use this pattern. Worst case:
both halves fail and the operator sees a combined error message;
either way the pipes are always drained so the next call's recv
matches the request it sent.
Note: the model is still poisoned in the current state — the
operator needs to either `POST /models/unload` + reload, or
`systemctl restart neuron`, to recover. The fix prevents *future*
desync; it doesn't repair existing stale pipe state.
Stage 7c-ii crash detection was tracked as the canonical solution to
this class of issue; this is the minimum-viable subset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the
Qwen3-Next architecture — plus the dispatch wiring needed to route a
load through it on both the leader and the workers.
Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for
the full-attention layers + a new pattern for linear-attention):
- TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size`
V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`,
`in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along
the V-head dim. `out_proj` is row-parallel + AllReduce (the only
collective inside the block). The recurrent state shards 1:1 with
V-heads — no cross-rank sync inside the delta-rule loop.
`in_proj_qkv` and `conv1d.weight` are FUSED tensors with three
regions along dim 0 (`[first key_dim, second key_dim, value_dim]`).
Standard uniform-slicing doesn't align with the head boundaries —
rank 0 would end up with `[first half of K_0, full K_1, first half
of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full
tensor, narrow per-region per-rank, and `Tensor::cat` the three
slices into a per-rank fused weight. Transient peak of one full
tensor per layer during construction; net memory is properly per-
rank after the full drops.
- TpQwen3_5Attention: column-parallel `q_proj` (the widened
`2 * num_heads * head_dim` output, including the gate half — shards
along the head axis so both query AND gate halves stay consistent
per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce.
Otherwise mirrors `tp_qwen3.rs`'s attention.
- TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types),
TpQwen3_5Model (with `model.language_model.*` prefix), and
TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level).
Dispatch wiring:
- New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant.
`WorkerPool::load_dense_shard` now dispatches on `model_type` from
the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two
downstream methods (`generate_step`, `clear_kv_cache`) thread this
enum through — the inner forward+clear_kv_cache dispatch happens
via the enum's pub methods. Adding another TP architecture later is
one more enum variant + match arms.
- Worker side gets a parallel `WorkerModel` enum + dispatch in
`handle_load_dense_shard`, branching on the same `model_type`.
- Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`.
`TpLoadedModel.leader_model` retyped to the enum.
Helpers in `arch/qwen3_5/linear_attn.rs`:
- `softplus` and `repeat_interleave` made `pub(crate)` so the TP
module reuses them rather than duplicating.
Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated
`Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial
RoPE with `partial_rotary_factor` already correct).
CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda`
also clean inside the patched runner container.
Single inflight risk to call out: tensor names. For full-attention
layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.*`
and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*`
— the same as the single-GPU path. lm_head sits at the top level (not
under `language_model`) — consistent with the single-GPU path that
validated against Qwen3.5-0.8B.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first
inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs:
BF16`. Trace through the recurrent delta-rule loop:
let q = (q.to_dtype(F32)? * scale)?; // F32
let k = k.to_dtype(F32)?; // F32
let v = v.to_dtype(F32)?; // F32
// g built from A_log/dt_bias // F32
// beta = sigmoid(b) // BF16 (sigmoid preserves dtype)
...
let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?;
^^^^^^^^^^^^^ ^^^^^^^^^
F32 BF16 ← mismatch
`g` was already F32 because it was constructed from `a_log.to_dtype(F32)`
+ `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from
`sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16
and the multiplication tripped candle's dtype-mismatch check.
Promote beta to F32 at the same point we promote q/k/v.
Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on
beast — load returned 200, then `POST /v1/chat/completions` returned
the dtype error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3-Next is a multimodal architecture whose text core sits under
`model.language_model.*` — sibling to `model.visual.*` (vision tower)
and to top-level `lm_head` / `mtp.*`. Every text-side tensor in the
safetensors files carries that prefix:
model.language_model.embed_tokens.weight
model.language_model.layers.{i}.{input,post_attention}_layernorm.weight
model.language_model.layers.{i}.linear_attn.{in_proj_*, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight}
model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight
model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight
model.language_model.norm.weight
lm_head.weight (top-level; not under language_model)
The single-pre-emptive fix is in Qwen3_5Model::load — derive a
`text_vb = vb.pp("model.language_model")` once and walk
embed_tokens / layers / norm from there. `lm_head` stays at the
top-level VB; that path was already correct.
The non-text tensors (`model.visual.*`, `mtp.*`) are ignored: we
don't reference them, so the safetensors mmap is fine even though
the bytes are loaded into the address space.
After this, the load that was failing at
"cannot find tensor model.embed_tokens.weight" should proceed to
materialising the actual layer weights — where any further bugs
will be substantive architecture issues rather than naming ones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and
the same applies to Qwen/Qwen3.6-27B):
1. Qwen3-Next config.json does NOT have a top-level `rope_theta`.
It lives inside `rope_parameters: { rope_theta, partial_rotary_factor,
rope_type, mrope_section, mrope_interleaved }`. Our TextConfig
declared `rope_theta` as a non-optional top-level field, so the
deserializer bailed with the misleading "missing field
`rope_theta` at line 74 col 5".
Replaced with a nested `RopeParameters` struct that mirrors the
upstream shape. Defaults are conservative (rope_theta=10000,
partial_rotary_factor=1.0) so a missing or partial block degrades
to standard full-rotation RoPE rather than failing.
2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of
the 256 head_dim values get RoPE applied — the rest pass through
unchanged. Our RotaryEmbedding was building the inv_freq table
for the full head_dim and rotating everything. Silently wrong
for every full-attention layer.
`RotaryEmbedding` now derives `rotary_dim` from
`head_dim * partial_rotary_factor`, builds its cos/sin tables at
that smaller size, and in `apply()` splits q/k into (rotate, pass)
on the last dim, only `rope_slow`-rotates the rotate half, and
re-concatenates. Mirrors the reference Python's
`apply_rotary_pos_emb` exactly for the non-trivial
`partial_rotary_factor` case.
Tests updated: config-deserialise fixture uses the real `rope_parameters`
shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The
linear-attention forward-smoke test was already using full rotation
which still works; just shifted to the nested struct.
After this, the load that previously failed at "parse Qwen3-Next
(qwen3_5) config.json: missing field rope_theta" should reach the
actual safetensors materialisation step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's
architecture). The four new modules wrap the substantive
`linear_attn.rs` (landed previously) with the rest of the
transformer:
- `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is
simplified to plain RoPE (the three position grids collapse to one
for text-only inference); uses candle's `rope_slow` for the
GLM-style rotate-half rotation.
- `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False).
- `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two
Qwen3-Next quirks:
- `q_proj` widened to `2 * num_heads * head_dim`; second half
sigmoid'd and multiplied into the attention output before `o_proj`.
- q_norm/k_norm use the `(1+w)*x` RmsNorm variant.
- `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on
`layer_types[i]` to either Full attention or GatedDeltaNet.
`arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer
stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The
forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's
`squeeze_to_vocab` handles either shape.
Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now
builds a `ShardedVarBuilder` instead of a plain VarBuilder. The
sharded backend falls through to the unsharded path when
`world_size=1`, so single-GPU load is zero-cost; this lets the
forthcoming `tp_qwen3_5.rs` reuse the same load functions without a
second copy.
Verified: cargo build CPU + --features cuda inside the patched
container; clippy clean on both; 32 lib tests still pass. The
ForCausalLM forward no longer bails — but numerical correctness vs
the Python reference hasn't been validated yet (that's the next
step, with the Tbilisi probe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the recurrent-path Gated DeltaNet block that occupies 48 of
Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`).
Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py`
(`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`,
`Qwen3_5RMSNormGated`, `l2norm`).
Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules
- `mod.rs` — Config + (still-stub) ForCausalLM
- `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState
- `rmsnorm.rs` — Qwen3_5RmsNorm `(1+w)*x`, Qwen3_5RmsNormGated, l2norm
Architecture pieces in this commit:
- Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj
(all bias=False); depthwise causal Conv1d (k=4) with state-aware
prepend; SiLU; per-head reshape; L2norm on q,k.
- Discretisation: g = -exp(A_log) * softplus(a + dt_bias); beta = σ(b).
All computed in f32 to avoid the -inf underflow in fp16 that the
reference notes.
- Delta rule (recurrent, per-token):
state *= exp(g_t)
kv_mem = state^T · k_t
delta = (v_t - kv_mem) * beta_t
state += outer(k_t, delta)
out_t = state^T · q_t
- Output: RMSNormGated(core_attn_out, z) reshape out_proj.
State (`GatedDeltaNetState`) lives inline on the layer:
- conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail.
- recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the
delta-rule outer-product memory.
Cleared via `clear_kv_cache` at the start of every new request.
Config extended with the qwen3_5-specific fields:
- linear_num_value_heads (48 in Qwen3.6-27B)
- linear_num_key_heads (16)
- linear_key_head_dim (128)
- linear_value_head_dim (128)
- linear_conv_kernel_dim (4)
- hidden_act ("silu")
Performance note: this is the **recurrent** delta-rule (PyTorch's
`torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L)
prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`,
chunk_size=64) is a follow-up perf optimisation; surface stays the
same.
8 unit tests:
- softplus small/large branches
- l2norm hand-calc + zero-vector stability
- repeat_interleave round-trip
- forward_smoke on tiny dims (4-head fixture) — verifies shape +
no NaN/Inf propagation through the f32-promotion pipeline. Doesn't
validate numerical correctness against the Python reference; that
requires a fixed-weight fixture and is the next step.
cargo clippy CPU + --features cuda both clean; 32 lib tests pass.
The ForCausalLM stub still bails on forward — wrapping
attention/MLP/decoder layer + lm_head is the next sub-stage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lays the wiring for the top-priority TP-2 target without doing the
substantive architecture work yet. After this commit, attempting to
load a Qwen3.6 (`model_type = "qwen3_5"`) model:
- Passes config.json parse — the real upstream shape (text_config
wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round-
trips through a typed Config (unit test included).
- Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a
ModelArch::Qwen3_5Dense variant, registers it in the loaded set.
- Fails on the first inference forward with a clear "Qwen3-Next
forward not implemented yet (Stage 8c, TP-2 motivator)" — the
point where the real architecture work begins.
New layout:
- `harness/arch/` for custom architectures candle-transformers doesn't
ship. Each architecture is one module: Config + ForCausalLM + impl.
- `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the
open work: layer_types dispatch (full_attention vs linear_attention,
the latter being the hard part with no candle precedent),
attn_output_gate, text_config nesting, recurrent state lifecycle.
- DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a
branch that constructs the stub.
TP-side gate:
- New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass
the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the
worker pool's `load_dense_shard` reconstructs the config as Qwen3
on every rank — silently misrouting a non-Qwen3 dense load through
it would surface as a cryptic per-rank deserialise error.
- TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else
bails *before* the worker pool spawns and NCCL handshake costs are
paid, with a marker pointing at the `tp_<family>.rs` module a
contributor would need to add. qwen3_5 specifically lands here
until its architecture is real.
The naming choice: keep "qwen3_5" from the model's own config.json
rather than mistralrs's "qwen3_next" — the latter ages poorly the
moment Qwen ship another architecture revision.
Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate);
the previously-rejecting test for qwen3_5 swapped to a fictional
arch so it stays meaningful as the supported set grows. 26 lib tests
pass; cargo clippy CPU + --features cuda both clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broadens the single-GPU dense and quantized paths to cover three
non-Qwen3 architectures already shipped by candle-transformers. TP for
these is a separate stage (each family would need its own tp_*.rs
mirroring tp_qwen3.rs).
`ModelArch` gains four variants:
- LlamaDense (boxed — wraps Llama + an inline Cache + the config it
takes to rebuild the cache, since candle::llama::Cache has no reset)
- LlamaQuantized (candle_transformers::models::quantized_llama)
- Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM)
- Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE
— takes an explicit compute dtype; F16 by default for best
consumer-GPU throughput)
The dispatch is method-based now:
- `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>`
with a shared `squeeze_to_vocab` normalising shape differences
(qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families
may differ again — the helper handles all of them).
- `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs
a Cache rebuild because its Cache has no in-place reset; the new
`LlamaDense` wrapper holds the bits needed to do it.
`run_inference` / `run_inference_streaming` collapse to a single
dispatch path: no more per-variant match arms in the hot loop, and
new architectures pick up streaming + non-streaming for free with
zero changes outside `ModelArch`.
DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"].
GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with
no underscore matches llama.cpp's general.architecture convention).
Stage 8a's diagnostic auto-reports the new supported set.
The `LlamaDense` variant is boxed because the wrapper's inline Cache
+ Config makes it 544 bytes vs ~300 for everything else
(clippy::large_enum_variant).
Verified: cargo test --workspace passes 66 tests; cargo clippy CPU
and `--features cuda` both clean (the cuda check ran inside the
locally-built `neuron-build-local` container with the math_functions.h
patch applied).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A request to load Qwen/Qwen3.6-27B (model_type "qwen3_5") on the
dense path was failing deep inside serde with:
missing field `vocab_size` at line 140 column 1
…because Qwen3.6 wraps its actual hyperparameters under `text_config`,
so none of `qwen3::Config`'s expected top-level fields are present.
The error gave no hint that the *architecture* was the problem.
`check_dense_config_supported` parses `config.json` as an untyped
JSON Value, inspects `model_type` (with `architectures` as bonus
context), and bails cleanly when it's not in the supported set
(currently `["qwen3"]`). The error names the rejected type, the
supported set, and points at the files a contributor needs to touch
to extend coverage — both the single-process `ModelArch` variants in
`candle.rs` and the TP analogue in `tp_qwen3.rs`.
Wired into both load paths:
- `load_arch_dense` (single-GPU), before the typed deserialize.
- `load_tp`, before spawning the worker pool — TP loads of an
unsupported arch now fail before NCCL/init costs are paid.
4 unit tests cover the accept/reject/missing-field/malformed cases.
Bonus: makes Stage 8b/8c work easier — adding a new architecture is
now a `DENSE_SUPPORTED_MODEL_TYPES` edit + ModelArch variant + load
branch, with the diagnostic auto-correctly listing the supported set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the candle harness's HuggingFace cache directory with the
following precedence (first hit wins):
1. Explicit `hf_cache` in `[harness.candle]` from neuron.toml.
2. `HF_HUB_CACHE` env var — the Python `huggingface_hub` convention.
The Rust hf-hub crate doesn't read this natively, so we bridge here.
3. `HF_HOME` env var (`$HF_HOME/hub` per the canonical layout).
4. None — falls through to hf-hub's own default.
Honouring HF_HUB_CACHE lets a neuron host reuse an existing cache
directory shared with Python tooling or other harnesses on the same
host without per-tool config. The canonical per-host setup is a
systemd drop-in:
/etc/systemd/system/neuron.service.d/local.conf
[Service]
Environment=HF_HUB_CACHE=/archive/hf-cache
neuron.example.toml documents the resolution chain inline.
script/validate-neuron.sh: bump LOAD_TIMEOUT from 600s to 3600s and
expose both load/infer timeouts via env (NEURON_LOAD_TIMEOUT,
NEURON_INFER_TIMEOUT). A Qwen3.6-class dense model is ~54 GB and was
hitting the 10-min ceiling cold-downloading on a residential link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the previous commit's naming of specific helexa neuron hosts
in the shipped example catalogue (`models.example.toml`) — the example
is supposed to be a generic starting point that any operator copies
and adapts, not a record of one particular fleet's layout.
- `pinned_on` in the TP example uses the placeholder
`"your-multi-gpu-neuron"`. Other entries keep the model ids
(since those are HuggingFace-canonical, not fleet-specific).
- New `models.toml` at repo root holds the helexa-fleet catalogue
(beast / benjy / quadbrat). Added to `.gitignore` alongside
`cortex.toml` — both are operator-owned, gitignored, RPM-marked
`%config(noreplace)`, and synced by `deploy.sh`.
- `deploy.sh` now rsync's `models.toml` to `/etc/cortex/models.toml`
on the gateway host on the same lifecycle as `cortex.toml`. Skips
cleanly when no local file exists, so users without a catalogue
aren't surprised by silent overwrites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`chat_completion_stream` no longer returns an error for TP loads. The
new `chat_completion_tp_stream` mirrors the non-streaming TP path
(clear_kv_cache, prefill, sample, decode loop) but emits one
`ChatCompletionChunk` per generated token over an mpsc channel so the
handler can write a streaming SSE response.
Unlike the single-GPU streaming path (which runs candle's forward
inside `spawn_blocking` and uses `blocking_send`), the TP loop is
itself async — every `pool.generate_step` already awaits the leader's
own spawn_blocking forward plus every worker's recv_only. So the
orchestration runs as a plain `tokio::spawn` task using `Sender::send`.
The shared `emit_chunk` helper tracks the cumulative decoded prefix and
emits the delta — same UTF-8-safe BPE boundary handling as the
single-GPU streaming path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded
Qwen3) end to end so a non-streaming chat completion can run across
multiple GPUs via NCCL.
RPC additions (tp/rpc.rs):
- LoadDenseShard{model_id, config_json, safetensors_paths}
- GenerateStep{model_id, tokens, offset}
- ClearKvCache{model_id}
- UnloadModel{model_id}
- LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded
Worker side (tp/worker.rs):
- WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>`
keyed by model_id. LoadDenseShard mmaps safetensors via
ShardedVarBuilder (only this rank's slice materialises), builds the
TP model with the rank's NCCL Comm cloned from NcclState.
- GenerateStep runs the rank-local forward; the resulting logits are
dropped (only the leader's are used for sampling). The forward's
value here is the NCCL collectives inside the row-parallel layers
letting the leader's rank-0 forward make progress.
Pool side (tp/mod.rs):
- WorkerPool::load_dense_shard fans LoadDenseShard out to every worker,
builds rank 0's shard on the leader via spawn_blocking with a fresh
SendComm wrapper at the move boundary (Comm is !Send at the type
level), collects per-rank LoadDenseShardOk. Returns the leader's
Arc<Mutex<TpQwen3ForCausalLM>>.
- WorkerPool::generate_step fans GenerateStep out, runs the leader's
rank-0 forward in spawn_blocking (the AllReduce CustomOps inside
row-parallel layers block until every worker issues the matching
collective), returns the leader's last-position logits Tensor.
- WorkerPool::clear_kv_cache + unload_model follow the same pattern.
NcclState refactor (tp/nccl_state.rs):
- comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers
can share a clone with TpQwen3ForCausalLM::load.
- new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves.
- single allow(clippy::arc_with_non_send_sync) at the canonical
construction site (Comm is !Send by type but the runtime invariant
is enforced by SendComm + the pool's Mutex).
Harness side (candle.rs):
- LoadedHandle enum (Single | Tp) replaces the bare Arc<LoadedModel>
in the harness's registry. list_models / unload_model /
inference_endpoint walk the enum uniformly.
- TpLoadedModel holds the pool + leader_model + tokenizer + devices.
- load_model dispatches on `spec.tensor_parallel > 1` to a new
cuda-gated load_tp path: resolve dense files via hf-hub, spawn the
pool, init_nccl, load_dense_shard.
- chat_completion branches on the handle variant. The TP path mirrors
run_inference: clear_kv_cache, prefill, sample, decode loop,
detokenize. Acquires the pool Mutex for the whole request.
- Streaming through TP is deferred to Stage 7c (returns Other(err)).
Script (script/validate-neuron.sh):
- 4th positional arg `tp_size` (default 1). When >1, switches to the
dense path (tp + GGUF is mutually exclusive — bails) and adds
`tensor_parallel` + `devices` to the load payload. NEURON_DEVICES
env overrides the default 0..N-1 device list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a one-shot diagnostic that exercises the lower half of the TP
stack — WorkerPool::spawn, init_nccl, nccl_sanity_check — in isolation
from model load and inference. Runs N-1 worker subprocesses (rank 0
stays in this process), joins them in an NCCL communicator on the
specified CUDA devices, all_reduces a sentinel 1u32 per rank, verifies
the observed_sum equals world_size on every rank, then shuts down.
Output is `status=ok` on stdout (plus key=value lines for tp_size and
cuda_devices) when every check passes, non-zero exit + tracing on
stderr otherwise. The smoke command is diagnostic-only and not exposed
through the daemon HTTP API.
script/tp-smoke.sh wraps it with an ssh invocation against a fleet
host (default beast — the only host with 2 GPUs) and asserts the
status line, mirroring the validate-neuron.sh ergonomics.
This is step 1 of the TP test plan. A failure here means TP cannot
work on the host at all; step 2 (Stage 7b-iv) wires real model load
and inference through the same WorkerPool primitives.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up cuda-only fixes surfaced by `cargo build --features cuda`
inside the cuda-13.0 runner container:
1. `half::{bf16, f16}` was an undeclared dep. Added `half = "2.5"`
(matching candle-core's pinned major) under the cuda feature flag.
2. `dev.alloc::<T>(n)` already returns `candle_core::Result` (it calls
`.w()` internally on the cudarc error). Calling `.w()?` on top of
that needs `From<candle_core::Error> for CudaError`, which doesn't
exist — collapse to `?`. Removed the now-unused
`cuda_backend::WrapErr` import.
Verified by `cargo build -p neuron --features cuda` and
`cargo clippy -p neuron --all-targets --features cuda -- -D warnings`
inside `git.lair.cafe/gongfoo/runner-cuda-13.0` with the local
glibc/CUDA-13.0 math_functions.h noexcept patch. CPU clippy/tests stay
green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 7b-iii (1/2) introduced AllReduce with `s.device()` and
`s.dtype()` calls on `&CudaStorage`. Both come from the
`candle_core::backend::BackendStorage` trait, which wasn't imported —
fine on CPU builds (the cuda_fwd block was cfg-gated out) but the
prerelease cuda build hit E0599.
Also drop the unused `cudarc::driver::DeviceSlice` import inside
cuda_fwd — `CudaSlice::len()` is an inherent method on cudarc 0.19,
not a trait method.
Caught by run 2894 (build-neuron-{blackwell,ampere}); CPU clippy +
tests stay green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors candle_transformers::models::qwen3 structurally with column-
parallel q/k/v + gate/up projections, row-parallel o + down projections,
and replicated embedding/norms/lm_head. Per-rank head counts come from
dividing num_attention_heads / num_key_value_heads by world_size at load
time; intermediate_size split likewise. Load bails on any non-divisible
shape — the safetensors slice would lose data otherwise.
KV cache holds the rank-local slice since K/V come out of column-parallel
projections; no cache resharding across ranks. Causal mask is computed
on rank 0 shape and broadcasts over the head dim so per-rank H differs
without rework.
Replicated tensors (embedding, all RmsNorms, untied lm_head) load via
vb.get(shape, name), which uses the default Shard { world_size: 1 } and
falls through to the unsharded backend path on ShardedSafeTensors.
The cuda / non-cuda load splits track the existing tp_linear pattern:
RowParallelLinear takes an Arc<Comm> only under cuda, and the higher-
level composers (TpQwen3MLP, TpQwen3Attention, TpDecoderLayer,
TpQwen3Model, TpQwen3ForCausalLM) thread it through accordingly.
7b-iv wires RPC + dispatch in CandleHarness::load_model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports the canonical
candle-examples/examples/llama_multiprocess/model.rs pattern into
the harness. Two new files, one deletion:
- harness/tp/all_reduce.rs — AllReduce wraps Arc<cudarc::nccl::Comm>
and implements candle's CustomOp1 trait. cuda_fwd extracts the
rank's CudaSlice<dtype> from a CudaStorage, asserts the input is
contiguous (a strided activation hitting all_reduce is almost
always a model construction bug), allocates an output CudaSlice
on the same device, calls Comm::all_reduce(Sum), and wraps the
result back as a CudaStorage. Handles BF16, F16, F32. NcclError
surfaces via {e:?} (no Display impl in cudarc 0.19.x). Send/Sync
hand-impl'd with the same NCCL-thread-safety caveat candle's
example documents.
- harness/tp/tp_linear.rs — ColumnParallelLinear and
RowParallelLinear, both built on candle's ShardedVarBuilder +
Shard hints. `vb.get_with_hints((), "weight", shard(dim, rank, ws))`
reads JUST the rank's slice from the safetensors view; no full-
tensor host materialisation. ColumnParallel.forward is a plain
local matmul (output is naturally sharded). RowParallel.forward =
local matmul + apply_op1_no_bwd(&self.all_reduce). On CPU /
world_size == 1, the AllReduce is skipped and the partial output
is returned as-is. Both layers are no-bias — every Qwen3-family
target sets attention_bias=false; bias-aware sharding is a
future-model concern.
- Deletes harness/tp/sharded_linear.rs from 7b-ii. That commit's
hand-rolled "load full + narrow" approach was useful exploration
but candle's ShardedVarBuilder does the same work without
materialising the full tensor on host. The 5 unit tests there
verified the slicing math against an unsharded reference; that
math now lives inside candle and is covered by candle's own tests.
Next (7b-iii 2/2): TpQwen3Attention + TpQwen3MLP composing the
column/row pair, then a TpQwen3Model that runs the full forward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught by live validation against Qwen/Qwen3-1.7B on beast:
HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])"
Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V]
(no final squeeze) while quantized_qwen3::ModelWeights::forward
returns [B, V] (with squeeze(1) at the end). My match arms applied
a single squeeze(0) uniformly, which is correct for the quantized
[1, V] → [V] but leaves the dense at [1, V] → which then trips
apply_repeat_penalty::to_vec1() expecting rank 1.
Dense match arms now strip both batch and seq dims:
model.forward(&input, offset)?.squeeze(0)?.squeeze(0)?
Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}`
(no colon) so passing an explicit empty third arg now drives the
dense path instead of falling back to Q4_K_M.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cuda-feature-only build errors only the CI runner catches:
1. cudarc::nccl::NcclError doesn't impl Display in 0.19.x, so the
`format!("...: {e}")` map_err calls fail to compile when the cuda
feature actually wires them up. Switch every NcclError-typed `{e}`
in nccl_state.rs to `{e:?}` — surfaces variant + ncclResult code
in the same diagnostic shape just via Debug instead of Display.
2. cudarc::CudaStream::memcpy_stod / memcpy_dtov are deprecated in
0.19.7 in favour of clone_htod / clone_dtoh. The replacements
take/return the same types, so the swap is mechanical.
Dev box can't compile with --features cuda (no nvcc), so these only
surface in the build-prerelease CUDA matrix jobs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds harness/tp/sharded_linear.rs with ShardedLinear — a Megatron-LM
style sharded wrapper over candle_nn::Linear. Two constructors:
- load_column: splits the output dimension. Each rank holds rows
[r*out/N .. (r+1)*out/N] of the weight, plus its slice of the bias.
Forward = local matmul; output is naturally sharded; downstream
consumer either accepts the shard (next layer is column-parallel)
or merges via all-gather later.
- load_row: splits the input dimension. Each rank holds cols
[r*in/N .. (r+1)*in/N] of the weight; bias lives only on rank 0
so the post-all_reduce sum carries it exactly once. Forward
produces a partial output that the caller reduces via NCCL.
Both constructors bail with a clear error when divisibility doesn't
hold — the precondition mistral.rs's first qwen3-next-tp commit
made explicit. The path included in the error is the VarBuilder
prefix, so the operator sees exactly which projection failed
("column-parallel 'model.layers.0.self_attn.q_proj': out_features=...").
5 unit tests on CPU verify the math against an unsharded reference:
- column shard produces the expected slice of the full matmul
- row partials sum to the unsharded result
- row bias appears only on rank 0
- divisibility violations bail (column + row)
forward_with_comm() is stubbed for row-parallel (CUDA-only) — wiring
the actual cudarc::nccl all_reduce against candle's Tensor lands in
7b-iii alongside the model assembly, where the model holds the Comm
in scope. ColumnParallel's forward_with_comm just delegates to the
local matmul (no collective needed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the bf16/fp16 safetensors path alongside the existing GGUF
quantized one. The harness now dispatches by ModelSpec.quant:
- Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged).
- None → safetensors dense (new).
The dense path uses candle-transformers::models::qwen3::ModelForCausalLM
verbatim, fed via VarBuilder::from_mmaped_safetensors over the files
listed in `model.safetensors.index.json` (sharded layout) or the
single `model.safetensors` fallback. dtype is bf16 to match the
canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from
the same repo (no -GGUF suffix to strip).
ModelArch gains a Qwen3Dense variant; the forward signature mirrors
QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position
logits), so run_inference / run_inference_streaming just add a parallel
match arm — no shape changes downstream.
This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on:
because the source is dense safetensors that can be byte-sliced per
rank, the TP work avoids the GGUF super-block alignment problem
entirely. Vanilla GGUF inference keeps working unchanged.
validate-neuron.sh learns the dense path: pass an empty third arg
(quant) and the script omits the `quant` field from the load
payload, triggering the dense dispatch. Example:
script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B ''
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.
NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
opens a CudaContext on the configured device, calls Comm::from_rank
with the supplied (rank, world_size, id). NCCL blocks until every
rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
the Comm across spawn_blocking boundaries (NCCL is move-safe; only
concurrent op issuance is unsafe).
WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.
Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.
Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
"cuda_feature_not_enabled" on the local dev box (no CUDA), or
accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
the real init + sanity round-trip; gated on the cuda-integration
feature so default CI doesn't try to NCCL.
Verifiable on beast (2× RTX 5090):
cargo test -p neuron --features cuda-integration \
--test tp_worker_lifecycle_cuda
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Leader → worker process plumbing for tensor parallelism. The neuron
binary picks up two modes: default (the existing daemon, axum + HTTP)
and `--worker` (a bare RPC loop driven over stdin/stdout). The leader
spawns one worker per non-zero NCCL rank via tokio::process::Command
on the same binary path (production: /proc/self/exe; tests:
env!("CARGO_BIN_EXE_neuron")) and talks to each over newline-
delimited JSON.
Protocol (harness/tp/rpc.rs) is serde-tagged from the start —
WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and
WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both
`#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c
is purely additive; unknown ops on the wire fail to parse (verified
in unit tests).
7a-i scope:
- WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as
subprocesses, captures stdin/stdout, kills on drop.
- ping_all() round-trips a Ping to every worker and validates the
returned rank.
- shutdown() sends Shutdown to each worker, awaits Bye, reaps.
- Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and
NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii
binary speaking the same wire is a drop-in replacement (the kind
field signals "real NCCL lands in the next commit").
- CandleHarness::load_model refuses tensor_parallel > 1 with a clear
message until 7b is in.
Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/
ping/shutdown for 2- and 3-worker pools, plus the
not_implemented_7a_i contract test for Init. Seven rpc serde unit
tests assert the wire shape (op tags, field names, unknown-op
rejection). All pass on the dev host; no CUDA required.
Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the
existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on
beast's 2×5090.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two followups from the live single-GPU validation pass.
1. deploy.sh now ensures libcudnn.so.9 is available on each neuron
host before installing/upgrading the package. Probes ldconfig first
so hosts with a manual (tar/runfile) cuDNN install are untouched,
then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't
ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13.
benjy hit "cannot open shared object file: libcudnn.so.9" during
validation; this prevents that recurring.
2. candle.rs applies a 1.1 repetition penalty over the last 64
generated tokens before sampling, in both the non-streaming
chat_completion path and the streaming chat_completion_stream
path. Without it small Q4_K_M models degenerate into "Wait, no,
no..." loops once they hit a confident-but-wrong path; with it
sampling stays coherent. Defaults match mistral.rs and llama.cpp;
exposing the value via the OpenAI request (frequency/presence
penalty mapping) is Stage 8 territory.
Both routes through a new sample_with_penalty() helper so future
sampling tweaks land in one place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dnf5's `dnf install <pkg>` is a no-op when the package is already
installed at ANY version — it does NOT auto-upgrade to the latest
available. The deploy script's install branch was therefore silently
leaving hosts on older builds even though needs_update correctly
reported an upgrade was available.
Add an is_installed() probe and an install_or_upgrade() helper that
picks the right verb: `dnf install` when fresh, `dnf upgrade` when
stale. Captured combined-stream output is exposed via __DNF_OUTPUT__
for the existing failure-diagnostic path.
Verified end-to-end against the live fleet: hanzalova/beast/benjy/
quadbrat all upgraded cleanly from prior prerelease NVRs to
0.1.16-0.1.20260519134302.git1866b99.fc43, validation script returned
"Paris" from all three neurons.
Followup (not in this commit): all hosts running helexa-neuron-*
need libcudnn.so.9 available at runtime. Currently:
- quadbrat: libcudnn9-cuda-13 RPM (rhel9 CUDA repo)
- beast: /usr/lib64/libcudnn.so.9 (manual install)
- benjy: needed rhel9 CUDA repo added + libcudnn9-cuda-13 installed
as part of this validation pass.
The spec currently excludes cuDNN from auto-detected deps. Should
add a Recommends:libcudnn9-cuda-13 (soft) and ensure the rhel9 CUDA
repo is configured on each neuron host, similar to how ensure_lair_repo
handles the unstable channel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three real bugs caught while exercising the script end-to-end against
the live quadbrat node:
1. say() printed status to stdout. Inside run_probe(), the
"POST /v1/chat/completions (probe: ...)" line was being captured
by `raw=$(run_probe)` along with the JSON body, so jq saw
"[host] POST..." as the first line and choked at column 29 with
"Invalid numeric literal" (it tried to parse the `[` as the start
of a JSON array). Redirect say() to stderr so command
substitutions capture only the intended return value.
2. The pretty-print step `echo "${raw}" | yq -r '.'` re-emitted the
JSON as YAML, which fails on response content that looks like YAML
markers (chatcmpl ids that parse as aliases, escaped quotes inside
<think>...</think> blocks). Drop the pretty-print; just echo the
raw JSON.
3. JSON response parsing now uses jq (always JSON) instead of yq
(parses input as YAML by default). yq remains in use only for the
genuinely-YAML asset/manifest.yml elsewhere.
4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think>
reasoning block before its final answer when the chat template
enables thinking mode, and that eats most of a small budget — the
"Paris" answer was being truncated mid-thought. 256 leaves enough
room for both.
Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere
git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M)
→ /v1/chat/completions → response content contains "Paris".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI hygiene fixes uncovered while validating against the live fleet.
1. Same-day prerelease packages were being ordered by RPM-vercmp's
alpha-vs-digit precedence on the git SHA fragment, not by commit
chronology. With release stamps like "0.1.${YYYYMMDD}git${SHA}",
two commits on the same day produce the same numeric prefix and
rpmvercmp falls back to comparing the alphanumeric SHA suffixes,
where digit-leading SHAs are ranked above alpha-leading ones —
completely unrelated to which commit landed first. Verified with
rpmdev-vercmp:
gitabc1234 < gitdef5678 (old scheme — purely lexicographic)
Bumping the timestamp prefix to second-precision (%Y%m%d%H%M%S)
makes the numeric prefix strictly monotonic for any chronologically-
ordered commits, so the SHA fragment becomes a debug identifier
only — never participates in version ordering.
2. ci.yml and build-prerelease.yml both target the `rust` runner label
and both auto-trigger on push to main. The act-based runner reuses
/root/.cache/act/<hash>/hostexecutor/ across concurrent jobs, so
ci.yml's clippy and build-prerelease.yml's build-cortex were racing
each other's checkout/cleanup steps and corrupting in-flight
compile artifacts. Real fix is in gongfoo; workflow-level workaround
is a shared concurrency group with cancel-in-progress=false so the
two workflows queue sequentially on the same ref.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GGUF-only HF repos (unsloth/Qwen3-*-GGUF, Qwen/Qwen3-*-GGUF) ship the
.gguf file but not tokenizer.json — the tokenizer data is embedded in
the GGUF metadata itself, and the standalone tokenizer.json lives in
the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.).
Live validation against quadbrat hit:
HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF:
HTTP status client error (404 Not Found)
resolve_files now derives the tokenizer repo by stripping a `-GGUF`
or `-gguf` suffix from the model_id; non-GGUF ids fall through to
fetching from the same repo. The error message includes the
attempted tokenizer repo id so the next failure (e.g. base repo
doesn't exist) is unambiguous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The build-prerelease workflow was workflow_dispatch-only, which meant
every commit needed a manual run dispatch before any host could
upgrade. That left rolling fixes (e.g. f9f5fa4's StateDirectory fix)
sitting on main with no published RPM behind them, so deploy.sh
silently fell back to an older prerelease.
Add 'push: branches: [main]' alongside the existing workflow_dispatch
trigger; the unstable channel now tracks head automatically. The
concurrency group is keyed on ${{ github.ref }} with
cancel-in-progress so successive rapid-fire pushes coalesce to one
build (latest wins) rather than queueing every intermediate commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HTTP handler now emits a tracing::warn on load_model failures with
the expanded anyhow chain (format!("{e:#}")) before returning the 400.
journalctl -u neuron will surface the underlying hf-hub /
materialisation error without needing to capture the curl response
body separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes uncovered by the live validation against beast/benjy/quadbrat:
1. api.rs swallowed everything beyond the outermost anyhow context.
The validation script reported '{"error":"fetch GGUF ...gguf"}' but
the actual underlying hf-hub failure (cache dir creation, network,
auth, etc.) was hidden. Switching every error response to
format!("{e:#}") expands the full cause chain via anyhow's
alternate Display format.
2. The neuron systemd unit declared the service user but never ensured
/var/lib/neuron (its $HOME) existed. hf-hub defaults its cache to
~/.cache/huggingface/hub — when $HOME is absent the cache dir
creation fails and the download aborts. Adding `StateDirectory=neuron`
makes systemd create + chown that directory at activation; no spec
change needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reasons the previous run silently bailed after POST /models/load:
1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships
ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename
matcher in CandleHarness::resolve_files returned "no GGUF file
matching quant Q4_K_M" and the load endpoint returned an error,
but the script used `curl --silent --fail` and swallowed it.
2. /models/load is synchronous (it awaits the full HF download + GGUF
parse). curl --max-time 30 was way too short for a 400 MB fresh
download.
Fixes:
- Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the
full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually
exists.
- trigger_load / run_probe now use --write-out to capture HTTP code
and emit the response body on non-2xx, so failures surface a real
diagnostic instead of an opaque set -e abort.
- LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s.
- Probe payload built via `yq -n` so JSON quoting is reliable
regardless of the prompt text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loads a small public Qwen3 GGUF on a target neuron host, fires a
deterministic reasoning probe ("What is the capital of France?"),
and asserts the response contains 'Paris'. Used to validate the
candle harness on a real GPU host before the Stage 7 TP work begins,
and as a regression check after future neuron builds.
Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M;
all three are positional args so the same script tests any node /
model combination. Polls /models after triggering the load since
/models/load returns once the materialisation is *queued*, not
finished.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The act runner container has no sudo binary; the runner user already
runs as root inside the container. Existing steps (rpmbuild, gpg, etc)
already invoke privileged commands directly without sudo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The currently-published runner-cuda-13.0 image (gongfoo) is missing
rust/cargo despite inheriting from runner-rust. Build-neuron fails
immediately with 'cargo: command not found' even though build-cortex
on the bare 'rust' runner builds fine.
Add a defensive `dnf install rust cargo clippy` step at the top of
build-neuron. Idempotent — on a properly-built runner image this is
a fast no-op; on the current broken image it installs the toolchain
in a few seconds. The runner image itself should be rebuilt in
gongfoo so this step becomes redundant.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the string compare of 'git describe --tags' vs the binary's
self-reported --version (which lies about prereleases — every
0.1.16-* RPM reports just "0.1.16") with the dnf-native question of
"is the installed package current against what the repo offers".
Mechanism:
- installed_nvr(): rpm -q --qf '%{version}-%{release}' for the
resident package, falling back to "(not installed)". Capturing rpm's
output through a variable keeps its "package X is not installed"
stdout message out of the result on failure.
- needs_update(): probes rpm -q first (treats absent as "needs work"),
then asks dnf check-update --refresh -q. Other dnf failures collapse
into "needs update" so the subsequent install surfaces a real error
rather than this check swallowing one silently.
- ensure_lair_repo(): probes for /etc/yum.repos.d/lair-cafe-unstable.repo
and adds it with `dnf config-manager addrepo` when missing. The
upstream .repo file ships enabled=0 (unstable channel doesn't
auto-engage on fetch), so we then run `dnf config-manager setopt
lair-cafe-unstable.enabled=1` every run — cheap, idempotent.
- Cortex and neuron install branches now guard `systemctl stop` with
`[ ! -f /usr/lib/systemd/system/...service ] || sudo systemctl stop`
so fresh installs (no unit file yet) don't short-circuit the install
step under set -e.
- dnf output is captured into a variable and only printed (with a
[host] prefix per line) on failure, so success stays quiet and
failures show the actual diagnostic instead of being eaten by
&> /dev/null.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 6 of the candle-native pivot. Adds first-class deactivation:
neuron now drains in-flight requests on SIGTERM (systemd stop) or
SIGINT (Ctrl-C), then unloads every loaded model before the process
exits — releasing CUDA contexts and VRAM cleanly rather than leaving
the OS to reclaim them.
Mechanism:
- startup::shutdown_signal() resolves on either ctrl_c() or a
SIGTERM listener.
- axum::serve(...).with_graceful_shutdown(shutdown_signal()) stops
accepting new connections, lets active requests finish, then
returns control to main.
- startup::unload_all_models(®istry) iterates list_all_models()
and calls unload per entry. Per-model failures are logged warnings;
cleanup continues. Empty registry is a fast no-op.
- main holds an Arc<NeuronState> reference past axum's lifetime so
the registry is still reachable for the unload sweep.
data/neuron.service:
- TimeoutStopSec=120s — generous bound for big-model unloads before
systemd escalates to SIGKILL.
- KillSignal=SIGTERM — explicit, matches the handler.
Two non-gated tests cover the empty-registry no-op and the no-models-
loaded path. Real load-then-unload-on-shutdown is exercised by the
cuda-integration test from Stage 2 (which calls unload_model directly)
and observable on a real GPU host by stopping the service and
watching nvidia-smi.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 5 of the candle-native pivot. Adds first-class support for
auto-loading a configured set of models when the neuron service
activates.
Config:
- NeuronConfig.default_models: Vec<ModelSpec> (defaults to []).
- neuron.example.toml ships a commented [[default_models]] example.
Activation flow (crates/neuron/src/startup.rs::load_default_models):
- Sequential — VRAM contention makes parallel loads risky.
- Per-entry timing logged at info level on success.
- Failures logged as warnings; the next entry is still attempted.
- An empty list short-circuits without log noise.
Called from main.rs after the registry is built and before the axum
listener binds, so /models reflects the loaded state from the very
first request.
data/neuron.service gains TimeoutStartSec=1800s. With activation
blocked on potentially slow first-time HF downloads + GGUF
materialisation, systemd's default 90s would kill larger model loads
mid-flight.
Two non-gated tests in tests/activation.rs cover the
continues-past-failure and empty-list paths using a synthetically
unknown harness name to fail loads fast without touching the network.
The cuda-integration test from earlier stages still exercises the
real load/unload lifecycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 4 of the candle-native pivot. /v1/chat/completions now switches
to text/event-stream when the request sets stream: true, emitting one
chat.completion.chunk per generated token followed by the OpenAI
[DONE] terminator.
Pipeline:
- chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32),
sends the leading role chunk, then spawns a blocking task that
acquires the per-model arch lock and runs the streaming generation
loop.
- run_inference_streaming tracks a cumulative decoded prefix so each
chunk's delta.content is the substring added since the last chunk —
safe across BPE byte-fallback boundaries that would otherwise split
multi-byte UTF-8 chars.
- The blocking task aborts cleanly if blocking_send fails (client
disconnected), so generation stops when the SSE consumer hangs up.
- Final chunk carries finish_reason ("stop" on EOS, "length" on
max_tokens). The handler appends data: [DONE] after the channel
closes.
The Stage 3 streaming 501 placeholder test is repurposed: with the
streaming path live, an unloaded model now hits the same 404 surface
as the non-streaming path (the model lookup happens first).
cortex-gateway's existing proxy is unchanged — it already forwards
SSE bytes verbatim from Phase 2 work, so the candle SSE format passes
through unmodified.
Neuron Cargo.toml gains futures + tokio-stream (both already in
workspace deps) for ReceiverStream and stream combinators.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The build-cortex and build-neuron jobs were running a copied-from-
mistralrs rustup install step. Both jobs use runner images that
already provide rust via dnf:
- runner-rust installs rust/cargo/clippy/rustfmt directly.
- runner-cuda-13.0 extends runner-rust.
Running 'rustup update stable' on top would install a parallel
rustup-managed toolchain and shadow the dnf one — confusing and
unnecessary. The existing ci.yml already trusts the dnf toolchain
without any install step, so match that behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single source of truth for which hosts run cortex vs neuron
and which CUDA compute-capability flavour each neuron host needs:
cortex : hanzalova.internal
neurons :
beast → helexa-neuron-blackwell (2x RTX 5090, sm_120)
benjy → helexa-neuron-ada (RTX 4090, sm_89)
quadbrat → helexa-neuron-ampere (RTX 3060, sm_86)
script/deploy.sh (gitignored, local-only) is updated locally to read
hosts and flavours from this manifest and dnf install the correct
helexa-neuron-<flavour> package per host. Using
'dnf install --refresh --allowerasing' lets it swap out the previous
bare helexa-neuron RPM or a different flavour without manual
intervention; the spec Conflicts: clauses keep at most one flavour
resident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ampere (CUDA compute capability sm_86) to both the build-neuron
and package-neuron matrices, so helexa-neuron-ampere RPMs are built
and published alongside helexa-neuron-ada and helexa-neuron-blackwell.
The prerelease spec already lists ampere in its Conflicts: clause, so
no spec change is needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the candle deps were added, cargo builds run long enough that
the parallel fmt/clippy/test jobs (all on the `rust` runner label,
which appears to use act in host-executor mode) start racing each
other's intermediate temp files under
/root/.cache/act/<hash>/hostexecutor/target/debug/deps/
Concretely the test job hit:
error: No such file or directory at path
"target/debug/deps/.tmprlicL7"
Compiling unicode-ident
because another job's cargo invocation cleaned up the temp file
mid-compile. fmt and clippy happened to finish without their own
target races landing fatally, so only test failed visibly.
Set CARGO_TARGET_DIR=target-${{ github.job }} at the workflow level
so each job writes to its own target directory. sccache still backs
the actual rustc cache, so the rebuild penalty is just metadata not
full recompiles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous "Import signing key" step inlined ${{ secrets.RPM_SIGNING_KEY }}
and ${{ secrets.RPM_SIGNING_KEY_ID }} directly into the run: block.
Template expansion writes the literal secret value into the rendered
shell script, and Gitea logs the rendered script — Gitea's masker may
not reliably scrub multi-line keys, so values can leak.
Move both secrets into the step's env: block (the same pattern the
"Set up SSH" step already uses) and reference $VARs in the script.
The script body now contains only variable names; the secret values
live in the process environment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a manually-triggered workflow that builds CUDA-flavoured neuron
binaries and a CPU cortex binary, packages them as Fedora RPMs, signs
them, and rsyncs to the unstable channel at
https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build
pipeline used by grenade/mistralrs-package.
Pipeline:
- prepare: derive {version,short_sha,commit_date} from the checkout;
the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below
the eventual "1" stable release.
- build-cortex: cargo build --release -p cortex-cli on a rust runner.
- build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on
cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn"
and CUDA_COMPUTE_CAP set per flavour.
- package-{cortex,neuron}: rpmbuild on the rpm runner against the new
prebuilt-binary specs in rpm/.
- publish: import signing key, sign RPMs, rsync to oolon, createrepo_c
--update, then regenerate packages.json for the UI.
New specs are prebuilt-binary variants — they consume the artifact
from the build job rather than running cargo at rpmbuild time. Each
helexa-neuron-{flavour} package Conflicts with the other flavours and
with helexa-neuron (the future source-build stable package) so one
flavour is installed at a time on a given host.
neuron crate gains cudnn and flash-attn feature flags forwarding to
the corresponding candle features, so the CI build command compiles
those kernels into the binary.
sccache is intentionally NOT used in the prerelease jobs — CUDA
compute cap isn't in its cache key, so flavours would mis-hit each
other. Each prerelease build is a clean cargo build.
Required Gitea secrets (already in place for cortex.spec / COPR
workflow):
- RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID
- RSYNC_SSH_KEY
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 3 of the candle-native pivot. neuron now serves
POST /v1/chat/completions backed by candle's quantized_qwen3 forward
pass on a per-model serialised generation loop, returning the standard
OpenAI ChatCompletionResponse envelope.
Pipeline per request:
- Look up the LoadedModel by request.model (404 if absent).
- Apply the Qwen3 chat template across all messages.
- Tokenize, then spawn_blocking onto tokio's blocking pool to acquire
the per-model arch lock and run prefill + greedy/temperature/top-p
sampling via LogitsProcessor.
- Stop on <|im_end|>/<|endoftext|> EOS or max_tokens (finish_reason
"stop" vs "length").
- Decode with skip_special_tokens=true, build OpenAI response with
prompt/completion/total usage counts.
Supporting changes:
- HarnessRegistry now stores Arc<dyn Harness> and caches a typed
Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch.
- LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard
can be moved into spawn_blocking.
- NeuronState gains an Option<Arc<CandleHarness>> field for the new
inference route.
- Typed InferenceError lets the handler map ModelNotLoaded → 404 and
other failures → 500 without string-matching anyhow messages.
- stream=true returns 501 until Stage 4 wires up SSE.
- Two leftover mistral.rs string references in proxy.rs and cortex-cli
(missed during the Stage 1 sweep) are corrected here.
Three new default-feature tests cover the no-candle 503, model-not-
loaded 404, and stream=true 501 paths. The cuda-integration test from
Stage 2 still covers real load/unload; a streaming-feature gated test
exercising actual generation will arrive with Stage 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a
LoadedModel registry keyed by model_id, hf-hub-backed GGUF download,
and Qwen3 quantized weight construction via candle-transformers'
quantized_qwen3 module. unload_model drops the entry; Drop on the
candle ModelWeights frees device memory.
Device selection prefers CUDA (gated behind the new `cuda` feature),
falling back to CPU when CUDA is unavailable so default builds work
on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless
`--features cuda` is passed, keeping CI green on CPU runners.
Config gains a [harness.candle] block with an optional hf_cache path.
HarnessRegistry::from_configs now takes HarnessSettings so per-harness
config flows through.
A gated tests/candle_lifecycle.rs exercises real load → list → unload
→ list-empty when run with `--features cuda-integration` against a
host with HF network access. The default-feature test in tests/api.rs
covers the wrong-harness rejection path without needing the network.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 1 of the candle-native pivot. Replaces the external-process
harness model (mistralrs over HTTP, llamacpp placeholder) with an
in-process Harness trait whose sole implementation is candle. The
trait keeps its shape so future engines slot in additively, but
start/stop default to no-ops and HarnessConfig drops endpoint and
systemd_unit since no harness needs external supervision.
Behaviour is unchanged on the wire: load_model returns a "not
implemented yet (Stage 2)" error and list_models is empty. The
gateway-side proxy, poller, and router are untouched.
CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are
marked superseded; the staged plan lives in
~/.claude/plans/create-a-more-aggressive-calm-naur.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cortex: opens 31313/tcp (API) and 31314/tcp (metrics)
neuron: opens 13131/tcp
Installs to /usr/lib/firewalld/services/ so firewall-cmd
--add-service=cortex / --add-service=helexa-neuron works
out of the box.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous defaults collided with well-trodden infra services and with
the Linux ephemeral port range:
- cortex API 8000 — common dev-server default (Django, minio UI)
- cortex metrics 9100 — Prometheus node_exporter default
- neuron API 9090 — Cockpit default on Fedora, Prometheus self
Move to helexa-themed palindromic ports, all below Linux's
32768-60999 ephemeral range and not registered to any well-known
service:
- cortex API 31313
- cortex metrics 31314
- neuron API 13131
Updated places:
- cortex.example.toml, neuron.example.toml defaults
- default impls in cortex-core and neuron config
- cortex-cli --endpoint default for the status subcommand
- doc comments citing example URLs
- README.md and CLAUDE.md snippets
Consumers already on the old ports need a one-line edit in their
/etc/cortex/cortex.toml or /etc/neuron/neuron.toml to match;
firewall rules and prometheus scrape configs will also need
updating.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache round-trip (download + unpack) was consistently taking
around 6 minutes, noticeably longer than the ~3 minute cold build
it was meant to accelerate. Net-negative on CI time — remove it.
sccache with the S3 backend still provides dep-level caching at a
much lower overhead, so we keep the majority of the cache benefit
without paying the actions/cache tarball cost.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidates the previous helexa/cortex and helexa/helexa-neuron COPR
projects into one shared project. Hosts enable a single repo and get
access to both packages — cortex for gateway hosts and helexa-neuron
for GPU nodes. Reduces the "which copr do I enable on this host"
friction, and makes it clear the two packages are parts of the same
helexa project suite.
CI keeps two independent publish jobs (copr-cortex and copr-neuron)
running in parallel; they now both target helexa/helexa with their
respective SRPMs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fedora's official repos ship a package named `neuron` — the NEURON
neural-simulation environment from Yale (see
https://src.fedoraproject.org/rpms/neuron). Having our own `neuron`
in the helexa COPR caused dnf5 to silently no-op `dnf install neuron`
because of the name collision, even with the COPR repo enabled and
keys imported. The only workarounds were full NEVRA (`dnf install
neuron-0.1.12-1.fc43.x86_64`) or a local file install — neither
acceptable for end-users.
Rename the RPM package to `helexa-neuron`. Keep binary (/usr/bin/neuron),
systemd unit (neuron.service), system user (neuron), and config dir
(/etc/neuron) unchanged — those are project-local contexts where the
short name is unambiguous. Follows Fedora subpackage-style naming
except with a vendor prefix rather than a parent-package prefix,
because neuron is an independent package from cortex (installed on
different hosts) and neither depends on the other.
Changes:
- neuron.spec -> helexa-neuron.spec (git rename)
- Name: neuron -> helexa-neuron (with comment explaining why)
- CI: srpm-neuron job now builds helexa-neuron-VERSION.tar.gz with the
matching top-level dir prefix, publishes to helexa/helexa-neuron COPR
- CI: bump-version job references helexa-neuron.spec
- CLAUDE.md: install instructions updated
Old helexa/neuron COPR project can be deleted after the first
helexa/helexa-neuron build lands.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the srpm-* jobs generated a fresh %changelog entry and
shipped it to COPR, but the version-stamped spec pushed back to main
by the bump-version job only updated the Version: line — not the
%changelog section. The result: SRPM and in-tree spec diverged and
a fresh clone of the repo showed a perpetually empty changelog.
Run the rpm-changelog action in bump-version too. Now the committed
specs track the SRPMs: each release leaves a dated %changelog entry
in main covering commits since the previous tag, visible in git log
and in the repo's spec browser.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diagnosing the persistent "Nothing to do" on v0.1.10 surfaced that
removing %attr(,,name) from %files wasn't enough. systemd-rpm-macros
ships its own rpm dep generator (/usr/lib/rpm/systemd.req) that parses
User=/Group= directives from every .service file the package ships
and emits Requires: user(NAME)/group(NAME) accordingly.
Rpmbuild log from v0.1.10 shows these Requires are still emitted even
after the %attr removal. Meanwhile the sysusers provides-generator
emits group(NAME) in both unversioned and versioned forms, but only
a versioned user(NAME) = <base64> when the u-line has GECOS/home/shell
fields. The asymmetry leaves Requires: user(NAME) unresolvable.
Add explicit Provides: user(NAME) back to both specs, with a comment
documenting the actual cause (systemd unit parsing, not file attrs)
so the next person touching these specs doesn't repeat the mistake.
Why monsoon didn't hit this: it creates its user in %pre via
groupadd/useradd (not sysusers.d), so no Provides are generated at
all — matching the Requires: user(monsoon) by luck of the rpm solver
treating unknown symbols as soft-fails for that path. Ours went through
the sysusers Provides code path and hit the asymmetry instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the local .gitea/scripts/generate-rpm-changelog.sh with the
shared composite action at https://git.lair.cafe/actions/rpm-changelog@v1.
Behaviour is identical — collect commits since the previous v* tag,
filter bump-version and merge noise, prepend a dated entry to the
spec — but the logic now lives in one place that other projects can
consume.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On every tag push, build a %changelog entry from the git log since
the previous v* tag and prepend it to each spec. Stops the initial
entry from drifting further and catches bogus-date / stale-version
warnings automatically since the generated date always matches the
day the CI runs.
The generator drops "chore: bump version" commits (bot-authored,
noisy in user-facing changelogs) and merge commits. Author defaults
to the gitea-actions identity but can be overridden via
CHANGELOG_AUTHOR env var if a human release is desired.
Requires fetch-depth: 0 on checkout so git describe can see prior
tags and git log can reach them.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
April 15 2026 was a Wednesday, not Tuesday. rpmbuild validates the
day-of-week against the date and warns on mismatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Using %attr(,,cortex) / %attr(,,neuron) on config files caused rpm's
auto-dep-generator to emit Requires: user(name) and group(name) on
each package. When those Requires couldn't be resolved — whether due
to sysusers Provides mismatches, missing GPG keys, or dnf5 cache
state — dnf5 silently filtered the package out of the candidate set
and reported "Nothing to do" rather than an unsatisfied-dep error.
Adopt the pattern that already works reliably across our infra
(grenade/monsoon): ship config files as default root:root with 0644
perms, don't declare user/group ownership in the rpm file list.
systemd-sysusers still creates the service user via the shipped
sysusers.d file; the service drops to that user at runtime via the
User= directive in the unit.
This removes the user(cortex)/user(neuron) Requires entirely, which
is the root cause of the dnf5 filtering. File permission tightening
can be reintroduced later — either via a separate secrets file with
different mode bits, or by moving secret material to /var/lib/<svc>/
where the service drop-privileges account already has write access.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
neuron and cortex are independent packages installable on different
hosts. Having neuron run under a 'cortex' system user implied a
shared identity that doesn't exist. Give neuron its own user/group.
- New data/neuron-sysusers.conf declares the neuron user/group with
home /var/lib/neuron.
- systemd unit User/Group changed to neuron.
- Spec file attrs, explicit Provides, and %sysusers_create_compat
updated to reference the neuron user.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The neuron package was shipping its config at /etc/cortex/neuron.toml,
which implied a shared config directory between two independent
packages. Move to /etc/neuron/neuron.toml — neuron owns its own etc
dir, consistent with its own /usr/lib/sysusers.d/neuron.conf and
/usr/lib/systemd/system/neuron.service. Updated the systemd unit's
ExecStart path and the example toml header to match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the in-repo .gitea/scripts/copr-build.sh and per-job
copr-cli configuration with the shared composite action at
https://git.lair.cafe/actions/copr-publish@v1. Behaviour is
identical — submit, watch, dump per-chroot logs — but the logic
now lives in a single place that other projects can consume.
Removes the actions/checkout step from both COPR jobs since the
build script is no longer local to this repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dnf5 was silently rejecting neuron-0.1.3 with "Nothing to do" because
it had an unresolvable Requires. Inspection showed:
Requires: user(cortex) ← unversioned
Provides: user(cortex) = <base64> ← versioned only, no unversioned
rpm's sysusers provides-generator only emits the unversioned user()
provide when the u-line is minimal. Our sysusers.conf specifies GECOS,
home dir, and shell, which pushes the generator to versioned-only.
The matching Requires (auto-generated from %attr(,,cortex) on config
files) is unversioned, so resolution failed silently.
Explicitly declare Provides: user(cortex) and Provides: group(cortex)
to guarantee the unversioned forms exist. group(cortex) was already
emitted unversioned but adding it for symmetry and to protect against
future generator changes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the COPR publish steps only surfaced copr-cli's status
updates (pending/importing/running). When a build failed, diagnosing
required clicking through to the COPR web UI. Now we submit with
--nowait, watch the build, then use copr-cli download-build to fetch
each chroot's builder-live.log and cat them as collapsible ::group::
blocks in the CI output.
Logic is factored into .gitea/scripts/copr-build.sh so cortex and
neuron jobs share it. Both COPR jobs now check out the repo to access
the script.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
echo "no default models configured — skipping LLM probe"
exit 0
fi
echo "LLM probe against ${model}"
probe_body=$(printf '{"model":"%s","messages":[{"role":"user","content":"Reply with exactly one word: pineapple"}],"max_tokens":512,"temperature":0}' "${model}")
helexa is a self-hosted LLM serving stack for multi-node GPU inference clusters. It has two components:
- **cortex** — the per-operator control plane and LLM proxy. A Rust reverse-proxy that sits in front of the fleet and presents a unified OpenAI + Anthropic compatible API surface. It handles model routing, lifecycle management (load/unload/evict), request translation, and metrics collection.
- **neuron** — the per-host LLM harness. One instance runs on every GPU host, serving candle-based in-process inference and managing local hardware discovery and model lifecycle.
- **cortex** is the control plane. It exposes the unified API, routes requests, manages model lifecycle across the fleet, and collects metrics.
- **neuron** is the node plane. One instance runs on every GPU host. It discovers local hardware, manages in-process candle inference, handles NCCL tensor parallelism, and reports runtime state.
- cortex never shells out to `nvidia-smi`, never touches systemd units, and never talks directly to a harness. It talks only to neurons via HTTP API on port 13131.
### Per-device worker thread (neuron)
Every CUDA device gets one dedicated OS thread that owns its `CudaContext` for the daemon's lifetime. All CUDA operations route through this thread via a `std::sync::mpsc` job channel. Tensors never escape the worker thread alive. Inference replies carry `Vec<f32>` CPU-side logits; sampled tokens come back as `u32`. The opaque `ArchHandle(u64)` and `TpHandle(u64)` are indices into the worker's state slab, not pointers.
CPU loads (`Device::Cpu` fallback) keep the legacy `tokio::task::spawn_blocking + Arc<Mutex<ModelArch>>` path — there's no context to own and the channel hop would only add latency. Four `spawn_blocking` references in `harness/candle.rs` are deliberate CPU fallback.
### candle-native (not mistral.rs)
neuron builds directly on [candle](https://github.com/huggingface/candle). Every model architecture it serves is implemented in this repository, ported against the HuggingFace reference. No external inference server to babysit. The Harness trait remains as an internal seam for adding future engines (vision/audio/diffusion) but its only implementation is in-process candle.
### Streaming proxy
Chat completions are proxied as SSE streams. The gateway must:
1. Parse the inbound request to extract the model name
2. Route to the correct backend neuron
3. Stream the response back, capturing token timing for metrics
4. NOT buffer the full response — true streaming passthrough
### Anthropic translation
When a request arrives at `/v1/messages` (Anthropic format), the gateway translates it to OpenAI format before proxying to neuron, then translates the response back. This is stateless envelope transformation. Non-streaming round-trip is implemented; streaming SSE translation deferred.
### Eviction
The evictor runs as a background task. Before loading a model on a node where VRAM is tight:
1. Check if the model is already loaded elsewhere → route there instead
2. Find the LRU model on the target node (excluding pinned models)
3. Call `POST {neuron}/models/unload` on that model
4. The incoming request's lazy-load triggers the new model load
### Metrics
Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port (31314).
## Tech Stack
- **Rust 2024 edition** — workspace with 6 crates
- **Axum 0.8** — HTTP framework
- **reqwest** — HTTP client for proxying to backends
Run these locally before pushing. `cargo fmt --all` fixes formatting automatically. Clippy warnings must be resolved, not suppressed with `#[allow(...)]` unless there is a clear rationale.
Tagged releases (`v*`) build SRPMs for `cortex`, `helexa-neuron`, and `helexa-bench` and publish to COPR (`helexa/helexa`). Build metadata SHA injection: CI sets `HELEXA_BUILD_SHA=$(git rev-parse HEAD)`.
## Environment
- Targets Fedora 43 (systemd, SELinux enforcing)
- Nodes communicate over a private network (e.g. WireGuard mesh)
- cortex listens on port 31313 (API) and 31314 (metrics)
- neuron listens on port 13131 on each GPU host
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
## Conventions
- Error handling: `anyhow` for binaries, `thiserror` for library crates
- No `unwrap()` in library code; `expect()` only with clear rationale
- All public types derive `Debug, Clone, Serialize, Deserialize` where sensible
- Config structs use `figment` with TOML as primary source, env vars as override
- Prefer `Arc<RwLock<...>>` for shared fleet state; minimize lock duration
- SSE streaming uses `tokio_stream` + `eventsource-stream` for parsing
- Log at `info` for request routing, `debug` for proxy details, `warn` for eviction and node health, `error` for proxy failures
## Testing
### Gateway tests
Use mock neurons spawned via axum in `crates/cortex-gateway/tests/common/mod.rs`. Helpers: `spawn_mock_backend()`, `spawn_gateway()`.
### neuron integration tests
- Numerical reference tests (`numerical_reference.rs`) require `NEURON_REF_MODEL_PATH` env var pointing to a HF snapshot directory. Fixtures are f32-based for precision validation against HuggingFace transformers.
- CUDA integration tests (`tp_worker_lifecycle_cuda.rs`) gated behind `cuda-integration` feature; requires 2+ CUDA devices (e.g., 2x RTX 5090).
### Metrics testing
Use `install_test_recorder()` in test code to capture metrics without the HTTP listener.
## helexa-bench
A continuous, version-aware benchmark harness. Hits each neuron directly on `:13131`, exercises each warm model with a Scenario suite (chat-latency family), and records results into SQLite stamped with the neuron's full `BuildInfo`. The loop is version-aware: skips any (target, build SHA, model, scenario) cell already at `samples_per_version`.
Packaged as `helexa-bench` RPM (prebuilt-binary spec). One systemd unit, typically on the metrics host.
## helexa-acp
Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default. Intentionally self-contained: no workspace crate dependencies. Uses `agent-client-protocol` with `unstable_session_model` feature for Zed model picker support. Licensed Apache-2.0 (workspace is GPL-3.0).
## RPM Packaging
-`cortex.spec` — installs the `cortex` binary
-`helexa-neuron.spec` — installs the `neuron` binary under package name `helexa-neuron` (renamed to avoid Fedora's NEURON neural-simulation package collision)
- Systemd units in `data/cortex.service`, `data/neuron.service`
- Example configs: `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
Install:
```sh
dnf copr enable helexa/helexa
dnf install cortex # gateway host
dnf install helexa-neuron # GPU nodes
```
## Configuration Files
### cortex.toml (gateway)
```toml
[gateway]
listen="0.0.0.0:31313"
metrics_listen="0.0.0.0:31314"
[eviction]
strategy="lru"# lru | priority
defrag_after_cycles=50
[[neurons]]
name="beast"
endpoint="http://beast.internal:13131"
```
### models.toml (catalogue)
```toml
[[models]]
id="Qwen/Qwen3-Coder-30B-A3B-Instruct"
harness="candle"
quant="Q4_K_M"
vram_mb=19000
min_devices=2
min_device_vram_mb=10000
pinned_on=["beast"]# optional: never evict from these neurons
```
### neuron.toml (per-host)
Configured via figment + env override. See `neuron.example.toml` for reference.
## neuron API Endpoints
```
GET /discovery → hardware discovery (hostname, OS, CUDA, devices, harnesses)
GET /health → runtime GPU stats (VRAM, utilization, temperature)
GET /models → loaded/unloaded models with VRAM usage
POST /models/load → load a model with spec (quant, TP, devices)
POST /models/unload → unload a model, freeing device memory
GET /models/{id}/endpoint → inference URL for a model
GET /version → build metadata (SHA, features, candle version, etc.)
```
## Sources of Truth
When prose documentation conflicts with code, trust:
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.