helexa

helexa/helexa

Fork 0

Commit Graph

Author	SHA1	Message	Date
rob thijssen	bc74e0e95f	feat(#47 phase 1a): EntitlementProvider trait + local/static provider Some checks failed CI / Format (push) Successful in 38s Details CI / CUDA type-check (push) Successful in 1m39s Details CI / Clippy (push) Successful in 2m26s Details CI / Test (push) Successful in 4m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 32s Details build-prerelease / Build neuron-blackwell (push) Successful in 1m40s Details build-prerelease / Build neuron-ada (push) Successful in 2m19s Details build-prerelease / Build neuron-ampere (push) Successful in 2m22s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s Details build-prerelease / Build cortex binary (push) Successful in 3m0s Details build-prerelease / Test (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m50s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m54s Details build-prerelease / Build helexa-bench binary (push) Successful in 2m12s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Stage 1's build seam (#50): the interface auth, metering, and budget enforcement all hang off, with a local/static provider so the A0 amplification fix can land before any upstream clearing house exists. The future helexa-upstream client (#57) is just another impl. - cortex-core::entitlements: Principal {account_id, key_id}, CapWindow (Balance \| Rolling{seconds}), Reservation handle, BudgetSnapshot, AuthError/BudgetError, and the async EntitlementProvider trait (resolve / reserve / settle / release / snapshot). BudgetError carries the window semantics so callers pick the #63 code (rate_limit_exceeded + Retry-After vs insufficient_quota) without the provider touching HTTP. - cortex-core::config: [entitlements] section on GatewayConfig (require_auth + [[entitlements.keys]] with account_id, optional key_id, hard_cap, window). Additive + serde(default) — anonymous/uncapped when omitted, so existing setups are unaffected. - cortex-gateway::entitlements_local: LocalEntitlementProvider. Budget math serialized under one Mutex so spent+reserved can never exceed a hard cap under concurrency (the #52 guarantee); rolling windows reset lazily; uncapped keys (no hard_cap) always reserve but still meter. - CortexState gains Arc<dyn EntitlementProvider> + require_auth, built in from_config. Not yet consumed by the request path — auth middleware is 1b (#49), enforcement is 1d (#52). - cortex.example.toml documents the section; test GatewayConfig literals updated for the new field. 6 provider unit tests (resolve, unknown-key, round-trip, balance/rolling over-cap codes, uncapped infra key). Local fmt/clippy/test all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 19:00:05 +03:00
rob thijssen	8b2e01a072	feat(#67 phase 4): advertise neuron-computed limit on /models; drop catalogue override Some checks failed CI / Test (push) Waiting to run Details CI / Format (push) Successful in 35s Details CI / CUDA type-check (push) Successful in 2m12s Details CI / Clippy (push) Successful in 2m10s Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details The neuron now self-derives and advertises limit{context,input,output} per loaded model; cortex forwards it and stops consulting the operator-declared catalogue limit (which can't track hot-swapped models or live capacity). Operator-set `cost` still flows from the catalogue. neuron: - CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]). - LoadedHandle::derived_limit(): profile + live tightest-card free VRAM (single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate EMA (bootstrap until first sample) → derive_limit. None for arches without a context profile. No operator clamp here (advertise the honest derived value; the clamp is an enforcement-side backstop). - list_models() fills ModelInfo.limit from derived_limit (was None). - derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as "no VRAM ceiling" instead of collapsing to zero. cortex: - ModelEntry gains `limit`, copied from ModelInfo.limit by the poller. - /v1/models: catalogue `limit` no longer flows (Pass 1 sets None); Pass 2 adopts the neuron's limit, taking the tightest across neurons via tightest_limit(). cost unchanged. - model_limits.rs rewritten: catalogue limit (999999) is ignored; the neuron's ModelEntry.limit is advertised; cost still from catalogue. - All ModelEntry literals updated with the new field. fmt/clippy/test green; CUDA paths type-checked in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 14:10:20 +03:00
rob thijssen	8a636c687f	feat(cortex): per-model limit + cost on /v1/models; remove max_model_len All checks were successful build-prerelease / Resolve version stamps + change detection (push) Successful in 37s Details build-prerelease / Build neuron-blackwell (push) Successful in 1m36s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s Details build-prerelease / Build neuron-ada (push) Successful in 2m2s Details build-prerelease / Build neuron-ampere (push) Successful in 2m47s Details build-prerelease / Build helexa-bench binary (push) Successful in 2m8s Details build-prerelease / Build cortex binary (push) Successful in 2m35s Details build-prerelease / Test (push) Successful in 5m13s Details build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s Details Resolves #62. opencode's helexa provider discovers a model's serving budget from /v1/models and uses it to size context, trigger compaction, and show spend with no hand-configuration. Each model entry now carries: - limit { context, input?, output } — operator-declared in models.toml - cost { input, output, cache_read?, cache_write? } — USD per 1M tokens - tool_call / reasoning — runtime-detected by the candle harness and OR-ed in from each serving neuron Composition: the catalogue profile supplies limit/cost (Pass 1); the poller carries the neuron's detected tool_call/reasoning into ModelEntry, which the gateway unions onto the entry (Pass 2); aliases propagate every field (Pass 4). Wire types extend ModelInfo / ModelProfile / CortexModelEntry additively (serde default + skip_serializing_if), so older neurons and clients are unaffected. helexa-bench's ModelInfo constructor and the gateway test fixtures are updated for the new fields. Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost (catalogue) and tool_call + reasoning (runtime), and that max_model_len is gone. Removes max_model_len. It was write-only with no consumer — opencode's source references it nowhere and it is not an OpenAI /v1/models field — and doubly misleading: vLLM's max_model_len means total sequence length, but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap. The limit{} contract replaces it. The neuron's max_prompt_tokens remains the enforced prompt cap (neuron-side); cortex just stops re-advertising a derived, mis-named copy. Closes #66 — its stale-max_model_len premise is moot once the field is gone. limit/cost are operator-declared (catalogue) per #62's design; auto- deriving the advertised budget from each neuron's reported cap is a tracked follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 09:26:55 +03:00

Author

SHA1

Message

Date

rob thijssen

bc74e0e95f

feat(#47 phase 1a): EntitlementProvider trait + local/static provider

CI / Format (push) Successful in 38s

Details

CI / CUDA type-check (push) Successful in 1m39s

Details

CI / Clippy (push) Successful in 2m26s

Details

CI / Test (push) Successful in 4m49s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions

Details

build-prerelease / Resolve version stamps + change detection (push) Successful in 32s

Details

build-prerelease / Build neuron-blackwell (push) Successful in 1m40s

Details

build-prerelease / Build neuron-ada (push) Successful in 2m19s

Details

build-prerelease / Build neuron-ampere (push) Successful in 2m22s

Details

build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s

Details

build-prerelease / Build cortex binary (push) Successful in 3m0s

Details

build-prerelease / Test (push) Successful in 4m25s

Details

build-prerelease / Package cortex RPM (push) Successful in 1m32s

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m50s

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m54s

Details

build-prerelease / Build helexa-bench binary (push) Successful in 2m12s

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled

Details

Stage 1's build seam (#50): the interface auth, metering, and budget
enforcement all hang off, with a local/static provider so the A0
amplification fix can land before any upstream clearing house exists.
The future helexa-upstream client (#57) is just another impl.

- cortex-core::entitlements: Principal {account_id, key_id}, CapWindow
  (Balance | Rolling{seconds}), Reservation handle, BudgetSnapshot,
  AuthError/BudgetError, and the async EntitlementProvider trait
  (resolve / reserve / settle / release / snapshot). BudgetError carries
  the window semantics so callers pick the #63 code (rate_limit_exceeded
  + Retry-After vs insufficient_quota) without the provider touching HTTP.
- cortex-core::config: [entitlements] section on GatewayConfig
  (require_auth + [[entitlements.keys]] with account_id, optional key_id,
  hard_cap, window). Additive + serde(default) — anonymous/uncapped when
  omitted, so existing setups are unaffected.
- cortex-gateway::entitlements_local: LocalEntitlementProvider. Budget
  math serialized under one Mutex so spent+reserved can never exceed a
  hard cap under concurrency (the #52 guarantee); rolling windows reset
  lazily; uncapped keys (no hard_cap) always reserve but still meter.
- CortexState gains Arc<dyn EntitlementProvider> + require_auth, built in
  from_config. Not yet consumed by the request path — auth middleware is
  1b (#49), enforcement is 1d (#52).
- cortex.example.toml documents the section; test GatewayConfig literals
  updated for the new field.

6 provider unit tests (resolve, unknown-key, round-trip, balance/rolling
over-cap codes, uncapped infra key). Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-17 19:00:05 +03:00

rob thijssen

8b2e01a072

feat(#67 phase 4): advertise neuron-computed limit on /models; drop catalogue override

CI / Test (push) Waiting to run

Details

CI / Format (push) Successful in 35s

Details

CI / CUDA type-check (push) Successful in 2m12s

Details

CI / Clippy (push) Successful in 2m10s

Details

CI / Build cortex SRPM (push) Has been cancelled

Details

CI / Build neuron SRPM (push) Has been cancelled

Details

CI / Publish cortex to COPR (push) Has been cancelled

Details

CI / Publish neuron to COPR (push) Has been cancelled

Details

CI / Bump version in source (push) Has been cancelled

Details

The neuron now self-derives and advertises limit{context,input,output}
per loaded model; cortex forwards it and stops consulting the
operator-declared catalogue limit (which can't track hot-swapped models
or live capacity). Operator-set `cost` still flows from the catalogue.

neuron:
- CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]).
- LoadedHandle::derived_limit(): profile + live tightest-card free VRAM
  (single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate
  EMA (bootstrap until first sample) → derive_limit. None for arches
  without a context profile. No operator clamp here (advertise the honest
  derived value; the clamp is an enforcement-side backstop).
- list_models() fills ModelInfo.limit from derived_limit (was None).
- derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as
  "no VRAM ceiling" instead of collapsing to zero.

cortex:
- ModelEntry gains `limit`, copied from ModelInfo.limit by the poller.
- /v1/models: catalogue `limit` no longer flows (Pass 1 sets None);
  Pass 2 adopts the neuron's limit, taking the tightest across neurons
  via tightest_limit(). cost unchanged.
- model_limits.rs rewritten: catalogue limit (999999) is ignored; the
  neuron's ModelEntry.limit is advertised; cost still from catalogue.
- All ModelEntry literals updated with the new field.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-17 14:10:20 +03:00

rob thijssen

8a636c687f

feat(cortex): per-model limit + cost on /v1/models; remove max_model_len

build-prerelease / Resolve version stamps + change detection (push) Successful in 37s

Details

build-prerelease / Build neuron-blackwell (push) Successful in 1m36s

Details

build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s

Details

build-prerelease / Build neuron-ada (push) Successful in 2m2s

Details

build-prerelease / Build neuron-ampere (push) Successful in 2m47s

Details

build-prerelease / Build helexa-bench binary (push) Successful in 2m8s

Details

build-prerelease / Build cortex binary (push) Successful in 2m35s

Details

build-prerelease / Test (push) Successful in 5m13s

Details

build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s

Details

build-prerelease / Package cortex RPM (push) Successful in 1m18s

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s

Details

Resolves #62. opencode's helexa provider discovers a model's serving
budget from /v1/models and uses it to size context, trigger compaction,
and show spend with no hand-configuration. Each model entry now carries:

  - limit { context, input?, output }  — operator-declared in models.toml
  - cost  { input, output, cache_read?, cache_write? }  — USD per 1M tokens
  - tool_call / reasoning  — runtime-detected by the candle harness and
    OR-ed in from each serving neuron

Composition: the catalogue profile supplies limit/cost (Pass 1); the
poller carries the neuron's detected tool_call/reasoning into ModelEntry,
which the gateway unions onto the entry (Pass 2); aliases propagate every
field (Pass 4). Wire types extend ModelInfo / ModelProfile /
CortexModelEntry additively (serde default + skip_serializing_if), so
older neurons and clients are unaffected. helexa-bench's ModelInfo
constructor and the gateway test fixtures are updated for the new fields.
Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost
(catalogue) and tool_call + reasoning (runtime), and that max_model_len
is gone.

Removes max_model_len. It was write-only with no consumer — opencode's
source references it nowhere and it is not an OpenAI /v1/models field —
and doubly misleading: vLLM's max_model_len means total sequence length,
but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap.
The limit{} contract replaces it. The neuron's max_prompt_tokens remains
the enforced prompt cap (neuron-side); cortex just stops re-advertising a
derived, mis-named copy. Closes #66 — its stale-max_model_len premise is
moot once the field is gone.

limit/cost are operator-declared (catalogue) per #62's design; auto-
deriving the advertised budget from each neuron's reported cap is a
tracked follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-17 09:26:55 +03:00

3 Commits