10c151efa5c118e620188832ea25ec5de5528b44
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
1ebbe87651
|
feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m23s
build-prerelease / Build neuron-blackwell (push) Has started running
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Stage 8d (new): port the Gated DeltaNet CUDA kernels from
EricLBuehler/mistral.rs to close the ~500x decode performance gap
we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path
vs ~37 T/s in mistralrs on the same hardware).
This commit lays the build infrastructure with zero behavioural
change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the
qwen3_5 architecture and TP variant.
Added:
- `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile
every `src/cuda/*.cu` file into `libneuroncuda.a` under the `cuda`
feature, then links it + `cudart`. Mirrors mistralrs's
`mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same
sm_<80 bf16 gate).
- `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from
upstream:
* `gated_delta_rule_recurrence` (V-tiled per-token decode)
* `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill)
* `causal_conv1d_update` (single-token conv decode)
* `causal_conv1d_full` (multi-token conv prefill)
* `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) *
softplus(a + dt_bias))
- `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels,
cudarc::CudaSlice::device_ptr boilerplate identical to upstream.
- `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of
upstream's ffi.rs covering only the five GDN kernels; MoE / SSM /
top-k decls land here when we absorb those too).
- `crates/neuron/src/cuda/mod.rs` — re-exports + module docs.
Cargo wiring: `cudaforge` added as an optional build-dep, activated
by the `cuda` feature. CPU build is unchanged (the `cuda/` module is
fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the
patched container compiles `gdn.cu` (1 of 1 kernels) and links
clean.
Licensing: upstream files preserve their MIT origin via per-file
comment banners pointing to the mistralrs path. No behaviour-relevant
edits to the .cu kernels — local diff against upstream is just the
banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream;
their structure (module path `crate::cuda::ffi::*`) matches identically
so future kernel imports drop in unchanged.
CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean
inside the runner container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
96d8755245
|
fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m50s
build-prerelease / Build neuron-blackwell (push) Successful in 3m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m42s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s
Two follow-up cuda-only fixes surfaced by `cargo build --features cuda`
inside the cuda-13.0 runner container:
1. `half::{bf16, f16}` was an undeclared dep. Added `half = "2.5"`
(matching candle-core's pinned major) under the cuda feature flag.
2. `dev.alloc::<T>(n)` already returns `candle_core::Result` (it calls
`.w()` internally on the cudarc error). Calling `.w()?` on top of
that needs `From<candle_core::Error> for CudaError`, which doesn't
exist — collapse to `?`. Removed the now-unused
`cuda_backend::WrapErr` import.
Verified by `cargo build -p neuron --features cuda` and
`cargo clippy -p neuron --all-targets --features cuda -- -D warnings`
inside `git.lair.cafe/gongfoo/runner-cuda-13.0` with the local
glibc/CUDA-13.0 math_functions.h noexcept patch. CPU clippy/tests stay
green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
da068ded6d
|
Stage 7a-ii: real NCCL handshake behind the worker pool
Some checks failed
CI / Format (push) Failing after 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Failing after 3m33s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Failing after 4m19s
build-prerelease / Build neuron-ada (push) Failing after 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.
NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
opens a CudaContext on the configured device, calls Comm::from_rank
with the supplied (rank, world_size, id). NCCL blocks until every
rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
the Comm across spawn_blocking boundaries (NCCL is move-safe; only
concurrent op issuance is unsafe).
WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.
Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.
Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
"cuda_feature_not_enabled" on the local dev box (no CUDA), or
accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
the real init + sanity round-trip; gated on the cuda-integration
feature so default CI doesn't try to NCCL.
Verifiable on beast (2× RTX 5090):
cargo test -p neuron --features cuda-integration \
--test tp_worker_lifecycle_cuda
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
84f5662df1
|
feat(neuron): OpenAI-compatible SSE streaming chat completions
Stage 4 of the candle-native pivot. /v1/chat/completions now switches
to text/event-stream when the request sets stream: true, emitting one
chat.completion.chunk per generated token followed by the OpenAI
[DONE] terminator.
Pipeline:
- chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32),
sends the leading role chunk, then spawns a blocking task that
acquires the per-model arch lock and runs the streaming generation
loop.
- run_inference_streaming tracks a cumulative decoded prefix so each
chunk's delta.content is the substring added since the last chunk —
safe across BPE byte-fallback boundaries that would otherwise split
multi-byte UTF-8 chars.
- The blocking task aborts cleanly if blocking_send fails (client
disconnected), so generation stops when the SSE consumer hangs up.
- Final chunk carries finish_reason ("stop" on EOS, "length" on
max_tokens). The handler appends data: [DONE] after the channel
closes.
The Stage 3 streaming 501 placeholder test is repurposed: with the
streaming path live, an unloaded model now hits the same 404 surface
as the non-streaming path (the model lookup happens first).
cortex-gateway's existing proxy is unchanged — it already forwards
SSE bytes verbatim from Phase 2 work, so the candle SSE format passes
through unmodified.
Neuron Cargo.toml gains futures + tokio-stream (both already in
workspace deps) for ReceiverStream and stream combinators.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
5c957d08ec
|
ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe
Some checks failed
CI / Format (push) Successful in 36s
CI / Test (push) Failing after 53s
CI / Clippy (push) Successful in 2m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a manually-triggered workflow that builds CUDA-flavoured neuron binaries and a CPU cortex binary, packages them as Fedora RPMs, signs them, and rsyncs to the unstable channel at https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build pipeline used by grenade/mistralrs-package. Pipeline: - prepare: derive {version,short_sha,commit_date} from the checkout; the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below the eventual "1" stable release. - build-cortex: cargo build --release -p cortex-cli on a rust runner. - build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn" and CUDA_COMPUTE_CAP set per flavour. - package-{cortex,neuron}: rpmbuild on the rpm runner against the new prebuilt-binary specs in rpm/. - publish: import signing key, sign RPMs, rsync to oolon, createrepo_c --update, then regenerate packages.json for the UI. New specs are prebuilt-binary variants — they consume the artifact from the build job rather than running cargo at rpmbuild time. Each helexa-neuron-{flavour} package Conflicts with the other flavours and with helexa-neuron (the future source-build stable package) so one flavour is installed at a time on a given host. neuron crate gains cudnn and flash-attn feature flags forwarding to the corresponding candle features, so the CI build command compiles those kernels into the binary. sccache is intentionally NOT used in the prerelease jobs — CUDA compute cap isn't in its cache key, so flavours would mis-hit each other. Each prerelease build is a clean cargo build. Required Gitea secrets (already in place for cortex.spec / COPR workflow): - RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID - RSYNC_SSH_KEY Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
729317d1ef
|
feat(neuron): OpenAI-compatible non-streaming chat completion
Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <|im_end|>/<|endoftext|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
5c2bd1a1da
|
feat(neuron): wire candle harness load/unload via GGUF
Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
6c238f4557
|
refactor: rename cortex-neuron binary and crate to neuron
Package name, lib name, and binary all now just "neuron" without the cortex- prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
|||
|
26e5e7ead8
|
feat: implement mistral.rs harness and neuron model API
- MistralRsHarness: Harness trait impl wrapping mistral.rs HTTP API (list/load/unload models, health check, start/stop via systemd) - HarnessRegistry: maps harness name -> Box<dyn Harness>, built from neuron.toml config - Neuron API endpoints: GET /models, POST /models/load, POST /models/unload, GET /models/:id/endpoint - NeuronConfig: figment-based config loading from neuron.toml - Integration test: full model lifecycle through mock mistral.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
|||
|
6dc717ebcd
|
feat: add neuron daemon with GPU discovery and health endpoints
Replace cortex-agent stub with neuron (cortex-neuron binary). cortex-core additions: - discovery.rs: DeviceInfo, DiscoveryResponse, DeviceHealth, HealthResponse - harness.rs: Harness async trait, HarnessConfig, ModelSpec, ModelInfo neuron crate (crates/neuron/): - discovery.rs: nvidia-smi CSV parsing (pure functions) + system discovery via uname/nvidia-smi/nvcc - health.rs: cached GPU health polling every 5s - api.rs: GET /discovery and GET /health axum handlers - main.rs: CLI entrypoint with --port flag (default 9090) - harness stubs for mistralrs (Phase 8) and llamacpp (Phase 11) 12 new tests (9 unit + 3 integration), 35 total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |