Files
cortex/crates
rob thijssen c6022aa6b9
All checks were successful
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Successful in 2m6s
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 4m54s
CI / Test (push) Successful in 4m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 4m43s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
feat(stage-8b): Llama + Qwen3 MoE families on the candle harness
Broadens the single-GPU dense and quantized paths to cover three
non-Qwen3 architectures already shipped by candle-transformers. TP for
these is a separate stage (each family would need its own tp_*.rs
mirroring tp_qwen3.rs).

`ModelArch` gains four variants:
- LlamaDense (boxed — wraps Llama + an inline Cache + the config it
  takes to rebuild the cache, since candle::llama::Cache has no reset)
- LlamaQuantized (candle_transformers::models::quantized_llama)
- Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM)
- Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE
  — takes an explicit compute dtype; F16 by default for best
  consumer-GPU throughput)

The dispatch is method-based now:
- `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>`
  with a shared `squeeze_to_vocab` normalising shape differences
  (qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families
  may differ again — the helper handles all of them).
- `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs
  a Cache rebuild because its Cache has no in-place reset; the new
  `LlamaDense` wrapper holds the bits needed to do it.

`run_inference` / `run_inference_streaming` collapse to a single
dispatch path: no more per-variant match arms in the hot loop, and
new architectures pick up streaming + non-streaming for free with
zero changes outside `ModelArch`.

DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"].
GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with
no underscore matches llama.cpp's general.architecture convention).
Stage 8a's diagnostic auto-reports the new supported set.

The `LlamaDense` variant is boxed because the wrapper's inline Cache
+ Config makes it 544 bytes vs ~300 for everything else
(clippy::large_enum_variant).

Verified: cargo test --workspace passes 66 tests; cargo clippy CPU
and `--features cuda` both clean (the cuda check ran inside the
locally-built `neuron-build-local` container with the math_functions.h
patch applied).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:36:22 +03:00
..