4 Commits

Author SHA1 Message Date
e874c3483d fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires
Some checks failed
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
CI / Format, lint, build, test (push) Has been cancelled
Diagnosing the persistent "Nothing to do" on v0.1.10 surfaced that
removing %attr(,,name) from %files wasn't enough. systemd-rpm-macros
ships its own rpm dep generator (/usr/lib/rpm/systemd.req) that parses
User=/Group= directives from every .service file the package ships
and emits Requires: user(NAME)/group(NAME) accordingly.

Rpmbuild log from v0.1.10 shows these Requires are still emitted even
after the %attr removal. Meanwhile the sysusers provides-generator
emits group(NAME) in both unversioned and versioned forms, but only
a versioned user(NAME) = <base64> when the u-line has GECOS/home/shell
fields. The asymmetry leaves Requires: user(NAME) unresolvable.

Add explicit Provides: user(NAME) back to both specs, with a comment
documenting the actual cause (systemd unit parsing, not file attrs)
so the next person touching these specs doesn't repeat the mistake.

Why monsoon didn't hit this: it creates its user in %pre via
groupadd/useradd (not sysusers.d), so no Provides are generated at
all — matching the Requires: user(monsoon) by luck of the rpm solver
treating unknown symbols as soft-fails for that path. Ours went through
the sysusers Provides code path and hit the asymmetry instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:30:55 +03:00
2caaae018a ci: migrate rpm changelog generation to reusable action
Replace the local .gitea/scripts/generate-rpm-changelog.sh with the
shared composite action at https://git.lair.cafe/actions/rpm-changelog@v1.
Behaviour is identical — collect commits since the previous v* tag,
filter bump-version and merge noise, prepend a dated entry to the
spec — but the logic now lives in one place that other projects can
consume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:23:45 +03:00
18d00001cf ci: auto-generate rpm changelog entry per release
On every tag push, build a %changelog entry from the git log since
the previous v* tag and prepend it to each spec. Stops the initial
entry from drifting further and catches bogus-date / stale-version
warnings automatically since the generated date always matches the
day the CI runs.

The generator drops "chore: bump version" commits (bot-authored,
noisy in user-facing changelogs) and merge commits. Author defaults
to the gitea-actions identity but can be overridden via
CHANGELOG_AUTHOR env var if a human release is desired.

Requires fetch-depth: 0 on checkout so git describe can see prior
tags and git log can reach them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:04:36 +03:00
ad1442c096 fix(rpm): correct weekday in changelog entry
April 15 2026 was a Wednesday, not Tuesday. rpmbuild validates the
day-of-week against the date and warns on mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:58:40 +03:00
45 changed files with 531 additions and 4300 deletions

View File

@@ -1,326 +0,0 @@
name: build-prerelease
# Manually-dispatched workflow that builds CUDA-flavoured neuron binaries
# (and a single cortex binary), packages each as a Fedora RPM, signs
# them, and publishes to the `unstable` channel at rpm.lair.cafe.
#
# Trigger from the Gitea UI: Actions → build-prerelease → Run workflow.
# Optionally provide a `ref` to build from a non-default branch.
#
# The published packages are versioned as e.g.
# helexa-neuron-blackwell-0.1.16-0.1.20260518gitabcdef0.fc43.x86_64
# so they sort BELOW the eventual 0.1.16-1 stable release.
on:
# Auto-build on every push to main so the unstable channel tracks
# head without a manual dispatch step.
push:
branches: [main]
# Manual dispatch still available to build from a non-main ref.
workflow_dispatch:
inputs:
ref:
description: "Git ref to build (branch / tag / commit). Defaults to the workflow's branch."
required: false
default: ""
concurrency:
# Coalesce on branch+event so successive pushes don't pile up; the
# latest push wins.
group: prerelease-build-${{ github.ref }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
jobs:
prepare:
name: Resolve version stamps
runs-on: rust
outputs:
version: ${{ steps.info.outputs.version }}
release: ${{ steps.info.outputs.release }}
short_sha: ${{ steps.info.outputs.short_sha }}
commit_date: ${{ steps.info.outputs.commit_date }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0
- id: info
run: |
set -eux
VERSION=$(awk -F\" '/^version[[:space:]]*=/ { print $2; exit }' Cargo.toml)
SHORT_SHA=$(git rev-parse --short=7 HEAD)
COMMIT_DATE=$(git log -1 --format=%cd --date=format:%Y%m%d HEAD)
# Prerelease release stamp sorts before "1" (the stable release).
RELEASE="0.1.${COMMIT_DATE}git${SHORT_SHA}"
echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
echo "release=${RELEASE}" >> "$GITHUB_OUTPUT"
echo "short_sha=${SHORT_SHA}" >> "$GITHUB_OUTPUT"
echo "commit_date=${COMMIT_DATE}" >> "$GITHUB_OUTPUT"
build-cortex:
name: Build cortex binary
needs: prepare
# runner-rust image already provides rust/cargo/clippy/rustfmt via
# dnf — no rustup install step needed.
runs-on: rust
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Build cortex (release)
run: cargo build --release -p cortex-cli
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/cortex artifacts/cortex
./artifacts/cortex --version || true
- uses: actions/upload-artifact@v3
with:
name: cortex-fc43
path: artifacts/cortex
retention-days: 1
build-neuron:
name: Build neuron-${{ matrix.flavour }}
needs: prepare
strategy:
fail-fast: false
matrix:
include:
- flavour: ampere
compute_cap: "86"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn flash-attn"
- flavour: ada
compute_cap: "89"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn flash-attn"
- flavour: blackwell
compute_cap: "120"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn flash-attn"
runs-on: ${{ matrix.runner }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Build neuron with CUDA (${{ matrix.flavour }})
run: |
set -eux
export PATH="${{ matrix.cuda_home }}/bin:${PATH}"
export LD_LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LIBRARY_PATH:-}"
cargo build --release -p neuron --features "${{ matrix.cargo_features }}"
env:
CUDA_COMPUTE_CAP: ${{ matrix.compute_cap }}
CARGO_BUILD_JOBS: ${{ matrix.build_jobs }}
NVCC_THREADS: ${{ matrix.nvcc_threads }}
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/neuron artifacts/neuron-${{ matrix.flavour }}
file "artifacts/neuron-${{ matrix.flavour }}"
- uses: actions/upload-artifact@v3
with:
name: neuron-${{ matrix.flavour }}-fc43
path: artifacts/neuron-${{ matrix.flavour }}
retention-days: 1
package-cortex:
name: Package cortex RPM
needs: [prepare, build-cortex]
runs-on: rpm
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- uses: actions/download-artifact@v3
with:
name: cortex-fc43
path: artifacts/
- name: Build RPM
run: |
set -eux
rm -f ~/.rpmmacros
rpmdev-setuptree
cp artifacts/cortex ~/rpmbuild/SOURCES/
cp data/cortex.service ~/rpmbuild/SOURCES/
cp data/cortex-sysusers.conf ~/rpmbuild/SOURCES/
cp data/cortex-firewalld.xml ~/rpmbuild/SOURCES/
cp cortex.example.toml ~/rpmbuild/SOURCES/
cp models.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/cortex-prerelease.spec \
--define "cortex_version ${{ needs.prepare.outputs.version }}" \
--define "cortex_prerelease ${{ needs.prepare.outputs.release }}" \
--undefine dist \
--define "dist .fc43"
- uses: actions/upload-artifact@v3
with:
name: rpm-cortex-fc43
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
package-neuron:
name: Package helexa-neuron-${{ matrix.flavour }} RPM
needs: [prepare, build-neuron]
runs-on: rpm
strategy:
fail-fast: false
matrix:
include:
- flavour: ampere
- flavour: ada
- flavour: blackwell
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- uses: actions/download-artifact@v3
with:
name: neuron-${{ matrix.flavour }}-fc43
path: artifacts/
- name: Build RPM
run: |
set -eux
rm -f ~/.rpmmacros
rpmdev-setuptree
cp artifacts/neuron-${{ matrix.flavour }} ~/rpmbuild/SOURCES/
cp data/neuron.service ~/rpmbuild/SOURCES/
cp data/neuron-sysusers.conf ~/rpmbuild/SOURCES/
cp data/neuron-firewalld.xml ~/rpmbuild/SOURCES/
cp neuron.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/helexa-neuron-prerelease.spec \
--define "neuron_version ${{ needs.prepare.outputs.version }}" \
--define "neuron_flavour ${{ matrix.flavour }}" \
--define "neuron_prerelease ${{ needs.prepare.outputs.release }}" \
--undefine dist \
--define "dist .fc43"
- uses: actions/upload-artifact@v3
with:
name: rpm-neuron-${{ matrix.flavour }}-fc43
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
publish:
name: Publish to rpm.lair.cafe (unstable)
needs: [package-cortex, package-neuron]
runs-on: rpm
concurrency:
group: rpm-publish
cancel-in-progress: false
env:
RPM_REPO_HOST: oolon.kosherinata.internal
FEDORA_VERSION: "43"
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Download all built RPMs
uses: actions/download-artifact@v3
with:
path: rpms/
pattern: rpm-*-fc43
- name: Flatten RPM artifacts
run: |
set -eux
find rpms/ -name '*.rpm' -exec mv --target-directory=rpms/ {} +
find rpms/ -mindepth 1 -type d -empty -delete
ls -la rpms/
- name: Check for sequoia-sq
run: |
if ! command -v sq &> /dev/null; then
echo "ERROR: sequoia-sq is not installed. Install with: sudo dnf install sequoia-sq"
exit 1
fi
- name: Import signing key
env:
# Pass secrets via env so values stay out of the rendered shell
# script (which Gitea includes in step logs). Template
# expansion of ${{ secrets.X }} inside `run:` writes the literal
# value into the script and depends on Gitea's log masker to
# scrub it — fragile for multi-line keys.
RPM_SIGNING_KEY: ${{ secrets.RPM_SIGNING_KEY }}
RPM_SIGNING_KEY_ID: ${{ secrets.RPM_SIGNING_KEY_ID }}
run: |
echo "$RPM_SIGNING_KEY" | gpg --batch --import
fpr=$(gpg --batch --with-colons --list-keys "$RPM_SIGNING_KEY_ID" | awk -F: '/^fpr:/ { print $10; exit }')
echo "${fpr}:6:" | gpg --batch --import-ownertrust
sed "s/@GPG_NAME@/$RPM_SIGNING_KEY_ID/" rpm/rpmmacros > ~/.rpmmacros
- name: Sign RPMs
run: |
set -eux
for rpm in rpms/*.rpm; do
echo "signing ${rpm}..."
rpm --addsign "${rpm}"
done
- name: Set up SSH for rsync
run: |
install --directory --mode 700 ~/.ssh
echo "${RSYNC_SSH_KEY}" | install --mode 600 /dev/stdin ~/.ssh/id_ed25519
env:
RSYNC_SSH_KEY: ${{ secrets.RSYNC_SSH_KEY }}
- name: Test SSH connectivity
run: |
ssh -o StrictHostKeyChecking=accept-new "gitea_ci@${RPM_REPO_HOST}" exit
- name: Ensure unstable repo directory exists
run: |
ssh "gitea_ci@${RPM_REPO_HOST}" \
"mkdir --parents /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable"
- name: Sync RPMs to unstable repo
run: |
rsync \
--archive \
--verbose \
--chmod D755,F644 \
rpms/*.rpm \
"gitea_ci@${RPM_REPO_HOST}:/var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/"
- name: Update unstable repo metadata
run: |
ssh "gitea_ci@${RPM_REPO_HOST}" \
"cd /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable && createrepo_c --update ."
- name: Generate packages.json manifest
run: |
scp script/generate-packages-json.py "gitea_ci@${RPM_REPO_HOST}:/tmp/"
ssh "gitea_ci@${RPM_REPO_HOST}" \
"python3 /tmp/generate-packages-json.py \
--repodata-dir /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/repodata \
--output /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/packages.json \
--base-url https://rpm.lair.cafe/fedora/${FEDORA_VERSION}/x86_64/unstable"

View File

@@ -16,42 +16,53 @@ env:
SCCACHE_S3_USE_SSL: "false" SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }} AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }} AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
# fmt, clippy, and test all run in parallel on the same `rust` runner
# and would otherwise share /root/.cache/act/<hash>/hostexecutor/target/,
# racing each other's cargo temp files (.tmpXXXXXX) and failing builds
# mid-compile. Give each job its own target directory so the invocations
# don't collide. sccache still backs the actual rustc cache, so the
# rebuild penalty is small.
CARGO_TARGET_DIR: target-${{ github.job }}
jobs: jobs:
fmt: check:
name: Format name: Format, lint, build, test
runs-on: rust runs-on: fedora
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
- run: cargo fmt --check --all
clippy: - name: Cache cargo registry and target
name: Clippy uses: actions/cache@v4
runs-on: rust with:
steps: path: |
- uses: actions/checkout@v4 ~/.cargo/bin
- run: cargo clippy --workspace -- -D warnings ~/.cargo/registry/index
- run: sccache --show-stats ~/.cargo/registry/cache
~/.cargo/git/db
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
restore-keys: |
${{ runner.os }}-cargo-
test: - name: Ensure sccache with S3 support
name: Test env:
runs-on: rust RUSTC_WRAPPER: ""
steps: run: |
- uses: actions/checkout@v4 if sccache --version 2>/dev/null && sccache --show-stats 2>/dev/null; then
- run: cargo test --workspace echo "sccache with S3 support already installed"
- run: sccache --show-stats else
cargo install sccache --features s3 --locked
fi
- name: Check formatting
run: cargo fmt --check --all
- name: Clippy
run: cargo clippy --workspace -- -D warnings
- name: Test
run: cargo test --workspace
- name: Show sccache stats
run: sccache --show-stats
srpm-cortex: srpm-cortex:
name: Build cortex SRPM name: Build cortex SRPM
runs-on: rpm runs-on: fedora
needs: [fmt, clippy, test] needs: check
if: startsWith(github.ref, 'refs/tags/v') if: startsWith(github.ref, 'refs/tags/v')
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
@@ -110,8 +121,8 @@ jobs:
srpm-neuron: srpm-neuron:
name: Build neuron SRPM name: Build neuron SRPM
runs-on: rpm runs-on: fedora
needs: [fmt, clippy, test] needs: check
if: startsWith(github.ref, 'refs/tags/v') if: startsWith(github.ref, 'refs/tags/v')
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
@@ -128,37 +139,37 @@ jobs:
run: | run: |
VERSION="${{ steps.version.outputs.VERSION }}" VERSION="${{ steps.version.outputs.VERSION }}"
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" helexa-neuron.spec sed -i "s/^Version:.*/Version: ${VERSION}/" neuron.spec
- name: Generate changelog entry - name: Generate changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1 uses: https://git.lair.cafe/actions/rpm-changelog@v1
with: with:
spec: helexa-neuron.spec spec: neuron.spec
version: ${{ steps.version.outputs.VERSION }} version: ${{ steps.version.outputs.VERSION }}
- name: Generate source tarball - name: Generate source tarball
run: | run: |
set -ex set -ex
VERSION="${{ steps.version.outputs.VERSION }}" VERSION="${{ steps.version.outputs.VERSION }}"
tar czf /tmp/helexa-neuron-${VERSION}.tar.gz \ tar czf /tmp/neuron-${VERSION}.tar.gz \
--transform "s,^\.,helexa-neuron-${VERSION}," \ --transform "s,^\.,neuron-${VERSION}," \
--exclude='./target' \ --exclude='./target' \
--exclude='./.git' \ --exclude='./.git' \
--exclude='*.tar.gz' \ --exclude='*.tar.gz' \
--exclude='*.src.rpm' \ --exclude='*.src.rpm' \
. .
mv /tmp/helexa-neuron-${VERSION}.tar.gz . mv /tmp/neuron-${VERSION}.tar.gz .
- name: Vendor Rust dependencies - name: Vendor Rust dependencies
run: | run: |
VERSION="${{ steps.version.outputs.VERSION }}" VERSION="${{ steps.version.outputs.VERSION }}"
cargo vendor vendor/ cargo vendor vendor/
tar czf helexa-neuron-${VERSION}-vendor.tar.gz vendor/ tar czf neuron-${VERSION}-vendor.tar.gz vendor/
rm -rf vendor/ rm -rf vendor/
- name: Build SRPM - name: Build SRPM
run: | run: |
rpmbuild -bs helexa-neuron.spec \ rpmbuild -bs neuron.spec \
--define "_sourcedir $(pwd)" \ --define "_sourcedir $(pwd)" \
--define "_srcrpmdir $(pwd)" --define "_srcrpmdir $(pwd)"
@@ -170,7 +181,7 @@ jobs:
copr-cortex: copr-cortex:
name: Publish cortex to COPR name: Publish cortex to COPR
runs-on: fedora-43 runs-on: fedora
needs: srpm-cortex needs: srpm-cortex
steps: steps:
- name: Download SRPM - name: Download SRPM
@@ -181,13 +192,13 @@ jobs:
- name: Publish to COPR - name: Publish to COPR
uses: https://git.lair.cafe/actions/copr-publish@v1 uses: https://git.lair.cafe/actions/copr-publish@v1
with: with:
project: helexa/helexa project: helexa/cortex
srpm: "*.src.rpm" srpm: "*.src.rpm"
copr-config: ${{ secrets.COPR_CONFIG }} copr-config: ${{ secrets.COPR_CONFIG }}
copr-neuron: copr-neuron:
name: Publish neuron to COPR name: Publish neuron to COPR
runs-on: fedora-43 runs-on: fedora
needs: srpm-neuron needs: srpm-neuron
steps: steps:
- name: Download SRPM - name: Download SRPM
@@ -198,53 +209,31 @@ jobs:
- name: Publish to COPR - name: Publish to COPR
uses: https://git.lair.cafe/actions/copr-publish@v1 uses: https://git.lair.cafe/actions/copr-publish@v1
with: with:
project: helexa/helexa project: helexa/neuron
srpm: "*.src.rpm" srpm: "*.src.rpm"
copr-config: ${{ secrets.COPR_CONFIG }} copr-config: ${{ secrets.COPR_CONFIG }}
bump-version: bump-version:
name: Bump version in source name: Bump version in source
runs-on: rust runs-on: fedora
needs: [copr-cortex, copr-neuron] needs: [copr-cortex, copr-neuron]
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Determine version - name: Stamp version and push
id: version
run: echo "VERSION=${GITHUB_REF#refs/tags/v}" >> "$GITHUB_OUTPUT"
- name: Stamp version
run: |
VERSION="${{ steps.version.outputs.VERSION }}"
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" cortex.spec
sed -i "s/^Version:.*/Version: ${VERSION}/" helexa-neuron.spec
cargo check --workspace 2>/dev/null || true
- name: Generate cortex changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1
with:
spec: cortex.spec
version: ${{ steps.version.outputs.VERSION }}
- name: Generate helexa-neuron changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1
with:
spec: helexa-neuron.spec
version: ${{ steps.version.outputs.VERSION }}
- name: Commit and push
env: env:
GITEA_TOKEN: ${{ secrets.GITEA_TOKEN }} GITEA_TOKEN: ${{ secrets.GITEA_TOKEN }}
run: | run: |
VERSION="${{ steps.version.outputs.VERSION }}" VERSION="${GITHUB_REF#refs/tags/v}"
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" cortex.spec
sed -i "s/^Version:.*/Version: ${VERSION}/" neuron.spec
cargo check --workspace 2>/dev/null || true
git config user.name "Gitea Actions" git config user.name "Gitea Actions"
git config user.email "actions@git.lair.cafe" git config user.email "actions@git.lair.cafe"
git add Cargo.toml Cargo.lock cortex.spec helexa-neuron.spec git add Cargo.toml Cargo.lock cortex.spec neuron.spec
if git diff --cached --quiet; then if git diff --cached --quiet; then
echo "Nothing to commit for ${VERSION}" echo "Version already at ${VERSION}"
else else
git commit -m "chore: bump version to ${VERSION}" git commit -m "chore: bump version to ${VERSION}"
git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/helexa/cortex.git" git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/helexa/cortex.git"

116
CLAUDE.md
View File

@@ -125,8 +125,7 @@ automatically. Clippy warnings must be resolved, not suppressed with
- One or more GPU nodes running mistral.rs on port 8080 - One or more GPU nodes running mistral.rs on port 8080
- Optionally a metrics-only node (no GPU) for Prometheus/Grafana - Optionally a metrics-only node (no GPU) for Prometheus/Grafana
- Each node runs `mistralrs serve` on port 8080 - Each node runs `mistralrs serve` on port 8080
- Gateway listens on port 31313 (API) and 31314 (metrics) - Gateway listens on port 8000 (API) and 9100 (metrics)
- neuron listens on port 13131 on each GPU host
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard - TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
## Conventions ## Conventions
@@ -381,7 +380,7 @@ processes (one process per loaded model, each on its own port).
## neuron API ## neuron API
neuron exposes an HTTP API on port 13131 that cortex polls and calls. neuron exposes an HTTP API on port 9090 that cortex polls and calls.
``` ```
GET /discovery GET /discovery
@@ -425,8 +424,8 @@ endpoint. cortex.toml shrinks to:
```toml ```toml
[gateway] [gateway]
listen = "0.0.0.0:31313" listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:31314" metrics_listen = "0.0.0.0:9100"
[eviction] [eviction]
strategy = "lru" strategy = "lru"
@@ -434,15 +433,15 @@ defrag_after_cycles = 50
[[neurons]] [[neurons]]
name = "beast" name = "beast"
endpoint = "http://beast.hanzalova.internal:13131" endpoint = "http://beast.hanzalova.internal:9090"
[[neurons]] [[neurons]]
name = "benjy" name = "benjy"
endpoint = "http://benjy.hanzalova.internal:13131" endpoint = "http://benjy.kosherinata.internal:9090"
[[neurons]] [[neurons]]
name = "quadbrat" name = "quadbrat"
endpoint = "http://quadbrat.hanzalova.internal:13131" endpoint = "http://quadbrat.hanzalova.internal:9090"
``` ```
On startup and periodically, cortex calls `GET /discovery` and On startup and periodically, cortex calls `GET /discovery` and
@@ -522,7 +521,7 @@ cortex/
│ │ └── metrics.rs # prometheus exporter (unchanged) │ │ └── metrics.rs # prometheus exporter (unchanged)
│ ├── neuron/ # node plane (replaces cortex-agent) │ ├── neuron/ # node plane (replaces cortex-agent)
│ │ └── src/ │ │ └── src/
│ │ ├── main.rs # binary entrypoint, axum server on :13131 │ │ ├── main.rs # binary entrypoint, axum server on :9090
│ │ ├── discovery.rs # nvidia-smi, device enumeration │ │ ├── discovery.rs # nvidia-smi, device enumeration
│ │ ├── health.rs # runtime GPU polling │ │ ├── health.rs # runtime GPU polling
│ │ ├── api.rs # HTTP handlers for /discovery, /models, etc. │ │ ├── api.rs # HTTP handlers for /discovery, /models, etc.
@@ -596,65 +595,70 @@ placement matching can be added incrementally.
Completed. Both packages have RPM specs, systemd units, and example configs. Completed. Both packages have RPM specs, systemd units, and example configs.
CI builds parallel SRPMs on tag push and publishes to separate COPR repos. CI builds parallel SRPMs on tag push and publishes to separate COPR repos.
- `cortex.spec` — installs the `cortex` binary. Package name keeps the - `cortex.spec` `helexa/cortex` COPR: binary, systemd unit, config files
short `cortex` because no Fedora package collides with it. - `neuron.spec``helexa/neuron` COPR: binary, systemd unit, config
- `helexa-neuron.spec` — installs the `neuron` binary under package name
`helexa-neuron`. Renamed from bare `neuron` to avoid collision with
Fedora's NEURON neural-simulation package
(https://src.fedoraproject.org/rpms/neuron); binary, systemd unit,
system user, and config dir all stay named `neuron` since those are
project-local contexts.
- `data/cortex.service`, `data/neuron.service` — systemd units - `data/cortex.service`, `data/neuron.service` — systemd units
- `cortex.example.toml`, `neuron.example.toml`, `models.example.toml` - `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
- CI: parallel `srpm-cortex` + `srpm-neuron` jobs, then parallel COPR - CI: parallel `srpm-cortex` + `srpm-neuron` jobs, then parallel COPR publish
publish to a single project `helexa/helexa` hosting both packages.
Install: Install:
```sh ```sh
dnf copr enable helexa/helexa dnf copr enable helexa/cortex && dnf install cortex # gateway host
dnf install cortex # gateway host dnf copr enable helexa/neuron && dnf install neuron # GPU nodes
dnf install helexa-neuron # GPU nodes
``` ```
## 2026-05-18 addendum: candle-native pivot ### Phase 11: llama.cpp harness stub
Phases 11 (llama.cpp harness) and 12 (mistral.rs COPR) below are **Goal:** Prove the harness abstraction works with a second engine.
**superseded**. The project no longer treats mistral.rs or llama.cpp as
dependencies — both are conceptually out of scope. neuron becomes a
candle-native inference daemon, with `Harness` retained as an
internal seam for adding future engines (vision/audio/diffusion) but
its only implementation being in-process candle.
The full staged plan for this pivot lives at **Steps:**
`~/.claude/plans/create-a-more-aggressive-calm-naur.md`. Summary: 1. `crates/neuron/src/harness/llamacpp.rs` — implement the `Harness`
trait for llama.cpp's `llama-server`.
- `start()` — launch `llama-server` with the correct model path,
`--port`, `--n-gpu-layers`, `--tensor-split` args. Track the
child process.
- `stop()` — send SIGTERM to the child process.
- `list_models()` — llama-server serves one model per process, so
return a single-element list.
- `load_model()` — start a new llama-server process for this model.
- `unload_model()` — stop the process.
- `inference_endpoint()` — return `http://localhost:{assigned_port}`.
2. Port allocation: neuron assigns ports from a range (e.g. 8100-8199)
to llama-server instances.
3. Register in `HarnessRegistry` when configured:
```toml
[[harnesses]]
name = "llamacpp"
binary = "/usr/local/bin/llama-server"
port_range = [8100, 8199]
```
4. Tests: mock llama-server (simple HTTP server returning canned
responses), test load/unload/endpoint lifecycle.
- **Stage 1 (this commit):** delete `mistralrs.rs` and `llamacpp.rs`, **Done when:** A model with `harness = "llamacpp"` in `models.toml` can
scaffold inert `CandleHarness`, drop `endpoint`/`systemd_unit` from be loaded and served through cortex. Tests pass with mock llama-server.
`HarnessConfig`, default no-op `start`/`stop` on the `Harness` trait.
- **Stages 24:** wire up candle model load/unload (quantized Qwen3
first), add OpenAI-compatible inference endpoint in neuron, then SSE
streaming.
- **Stages 56:** load-on-activation (default models in config) and
unload-on-deactivation (graceful shutdown).
- **Stages 78:** multi-GPU tensor parallelism and broader model/quant
coverage.
Sections of this document that describe mistral.rs HTTP behaviour ### Phase 12 (lower priority): mistral.rs COPR packaging
("mistral.rs API gotchas") are retained as historical context for
Phases 110 — they document what was true while the project depended
on mistral.rs. They do not describe current behaviour.
--- **Goal:** Fedora RPMs for mistral.rs built against specific CUDA versions.
### Phase 11 (superseded): llama.cpp harness stub **Steps:**
1. `mistralrs-cuda.spec` — RPM spec that clones a pinned mistral.rs git
tag, builds with `--features cuda`, links against the system CUDA
toolkit. Produces `mistralrs-cuda13-server` (CUDA 13.x / sm_120) and
`mistralrs-cuda12-server` (CUDA 12.x / sm_89). Install binary to
`/usr/local/bin/mistralrs`.
2. COPR build config: enable the NVIDIA CUDA repo as a build dependency.
Pin the CUDA toolkit version in `BuildRequires`.
3. Gitea Actions or manual workflow: bump the mistral.rs tag in the spec,
trigger COPR rebuild.
4. neuron's mistralrs harness config references which binary/package
provides the mistral.rs binary. neuron could warn at startup if the
installed mistral.rs CUDA version doesn't match the discovered driver.
~~Originally planned as a second engine to prove the harness **Done when:** `dnf install mistralrs-cuda13-server` on beast provides a
abstraction.~~ Replaced by the candle harness work in the 2026-05-18 working `mistralrs` binary built for Blackwell GPUs. `dnf install
addendum above. llama.cpp's any-model/any-hardware breadth is no mistralrs-cuda12-server` on benjy provides one built for Ada GPUs.
longer in scope for helexa.
### Phase 12 (superseded): mistral.rs COPR packaging This is a separate repo/spec — not part of the cortex workspace — but
tightly coupled operationally. Track it as a sibling project.
~~Originally planned to ship CUDA-versioned mistral.rs RPMs.~~ Replaced
by the candle harness work in the 2026-05-18 addendum above. With
mistral.rs out of the dependency tree, there is nothing to package.

1611
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -8,7 +8,7 @@ members = [
] ]
[workspace.package] [workspace.package]
version = "0.1.16" version = "0.1.8"
edition = "2024" edition = "2024"
license = "GPL-3.0-or-later" license = "GPL-3.0-or-later"
repository = "https://git.lair.cafe/helexa/cortex" repository = "https://git.lair.cafe/helexa/cortex"
@@ -27,7 +27,7 @@ serde = { version = "1", features = ["derive"] }
serde_json = "1" serde_json = "1"
toml = "0.8" toml = "0.8"
# http client (for proxying to neuron backends) # http client (for proxying to mistralrs backends)
reqwest = { version = "0.12", features = ["json", "stream"] } reqwest = { version = "0.12", features = ["json", "stream"] }
# observability # observability

105
README.md
View File

@@ -1,23 +1,22 @@
# cortex # cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference A Rust reverse-proxy and fleet management layer for multi-node
clusters. Cortex sits in front of one or more `neuron` daemons (each running [mistral.rs](https://github.com/EricLBuehler/mistral.rs) inference clusters.
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
## Problem ## Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different Running local LLMs across multiple GPU nodes (different VRAM tiers, different
model affinities) requires a unified API surface that: model affinities) requires a unified API surface that:
- Presents a **single `/v1/models` catalogue** merging every model that can be - Presents a **single `/v1/models` catalogue** merging every model across every
served by any neuron in the fleet. node.
- **Routes requests** to the correct node based on where a model is loaded - **Routes requests** to the correct node based on where a model is loaded (or
(or can be loaded), handling cold-load and eviction transparently. *can* be loaded).
- Manages **model lifecycle** load on demand, unload cold models, pin - Manages **model lifecycle** — unload cold models, reload on demand, pin
critical ones — by calling each neuron's `/models/{load,unload}` API. critical ones — using the mistral.rs
`/v1/models/{unload,reload,status}` HTTP API (PR #1828+).
- Translates between **OpenAI and Anthropic** request/response envelopes so - Translates between **OpenAI and Anthropic** request/response envelopes so
every client speaks whichever dialect it prefers. every client in the homelab speaks whichever dialect it prefers.
- Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes - Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes
them as Prometheus counters/histograms. them as Prometheus counters/histograms.
@@ -31,17 +30,18 @@ model affinities) requires a unified API surface that:
└────────────────┴──────┬───────┴───────────────┘ └────────────────┴──────┬───────┴───────────────┘
┌──────────▼──────────┐ ┌──────────▼──────────┐
cortex │ cortex │
│ (cortex-gateway) │ (cortex-gateway)
│ │ │ │
│ Router · Metrics │ │ Router · Metrics │
│ Evictor · Translate│ │ Evictor · Translate│
└──┬──────┬────────┬──┘ └──┬──────┬────────┬──┘
│ │ │ │ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐ ┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
neuron │ │ neuron │ │ neuron gpu-large │ │gpu-med │ │ gpu-small
:13131 │ │ :13131 │ │ :13131 mistralrs │ │mistral │ │ mistralrs
candle │ │ candle │ │ candle serve │ │rs serve│ │ serve
│ :8080 │ │ :8080 │ │ :8080 │
└───────────┘ └────────┘ └───────────┘ └───────────┘ └────────┘ └───────────┘
private network (.internal) private network (.internal)
``` ```
@@ -50,48 +50,70 @@ model affinities) requires a unified API surface that:
| Crate | Purpose | | Crate | Purpose |
|---|---| |---|---|
| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types | | `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic request/response envelopes |
| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, poller, metrics exporter | | `cortex-gateway` | Axum HTTP server: proxy, router, evictor, metrics exporter |
| `neuron` | Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API | | `cortex-agent` | Per-node sidecar: polls local mistralrs, reports to gateway, handles restart/defrag |
| `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) | | `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) |
## Node setup ## Node setup
Each GPU node runs `neuron` (listening on `:13131`). Neuron uses Each GPU node runs `mistralrs serve` with a multi-model config. Models are
huggingface/candle for in-process inference — there is no external declared but start **unloaded** — mistral.rs lazy-loads on first request and
inference subprocess to manage. the gateway can explicitly unload/reload via the HTTP API.
The neuron RPM (`helexa-neuron`) ships a systemd unit: Example node systemd unit:
```sh ```ini
dnf copr enable helexa/helexa # /etc/systemd/system/mistralrs.service
dnf install helexa-neuron [Unit]
systemctl enable --now neuron Description=mistral.rs inference server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mistralrs serve \
--from-config /etc/mistralrs/config.toml \
--port 8080
Restart=on-failure
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0,1
[Install]
WantedBy=multi-user.target
``` ```
## Gateway config ## Gateway config
```toml ```toml
# /etc/cortex/cortex.toml # cortex.toml
[gateway] [gateway]
listen = "0.0.0.0:31313" listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:31314" metrics_listen = "0.0.0.0:9100"
[eviction] [eviction]
strategy = "lru" # lru | priority strategy = "lru" # lru | priority
defrag_after_cycles = 50 defrag_after_cycles = 50
[[neurons]] [[nodes]]
name = "beast" name = "gpu-large"
endpoint = "http://beast.internal:13131" endpoint = "http://gpu-large.internal:8080"
vram_mb = 49_152 # e.g. 2x RTX 4090
pinned = ["your-org/large-model"]
[[neurons]] [[nodes]]
name = "benjy" name = "gpu-medium"
endpoint = "http://benjy.internal:13131" endpoint = "http://gpu-medium.internal:8080"
vram_mb = 24_576 # e.g. RTX 4090
pinned = ["your-org/medium-model"]
[[nodes]]
name = "gpu-small"
endpoint = "http://gpu-small.internal:8080"
vram_mb = 12_288 # e.g. RTX 3060
pinned = ["your-org/embedding-model"]
``` ```
Model placement profiles live in `models.toml` — see `models.example.toml`.
## Building ## Building
```sh ```sh
@@ -109,20 +131,19 @@ cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass cargo test --workspace # all tests must pass
``` ```
Tagged releases (`v*`) additionally build SRPMs for both `cortex` and Tagged releases (`v*`) additionally build an SRPM and publish to COPR.
`helexa-neuron` and publish to COPR.
## Running ## Running
```sh ```sh
# start the gateway # start the gateway
cortex serve --config /etc/cortex/cortex.toml cortex serve --config cortex.toml
# check fleet status # check fleet status
cortex status cortex status
# list all models across nodes # list all models across nodes
curl http://localhost:31313/v1/models curl http://localhost:8000/v1/models
``` ```
## License ## License

View File

@@ -1,30 +0,0 @@
# Helexa fleet manifest.
#
# Drives rolling deploys via script/deploy.sh and serves as the source
# of truth for which hosts run cortex vs neuron, and which CUDA
# compute-capability flavour each neuron host needs.
#
# Flavour ↔ NVIDIA generation ↔ compute cap:
# ampere sm_86 (RTX 30 series — e.g. 3060)
# ada sm_89 (RTX 40 series — e.g. 4090)
# blackwell sm_120 (RTX 50 series — e.g. 5090)
#
# The flavour determines which RPM is installed on a given neuron host:
# helexa-neuron-<flavour>. Only one flavour may be installed at a time
# (the packages Conflict: with each other).
cortex:
host: hanzalova.internal
neurons:
- host: beast.hanzalova.internal
flavour: blackwell
gpu: "2x RTX 5090"
- host: benjy.hanzalova.internal
flavour: ada
gpu: "RTX 4090"
- host: quadbrat.hanzalova.internal
flavour: ampere
gpu: "RTX 3060"

View File

@@ -3,22 +3,22 @@
# Copy to cortex.toml and adjust for your environment. # Copy to cortex.toml and adjust for your environment.
# #
# Environment variable overrides use CORTEX_ prefix with __ separators: # Environment variable overrides use CORTEX_ prefix with __ separators:
# CORTEX_GATEWAY__LISTEN=0.0.0.0:31313 # CORTEX_GATEWAY__LISTEN=0.0.0.0:9000
[gateway] [gateway]
listen = "0.0.0.0:31313" listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:31314" metrics_listen = "0.0.0.0:9100"
[eviction] [eviction]
strategy = "lru" strategy = "lru"
# Restart neurons after this many load/unload cycles to defragment VRAM. # Restart mistralrs after this many load/unload cycles to defragment VRAM.
# Set to 0 to disable. # Set to 0 to disable.
defrag_after_cycles = 50 defrag_after_cycles = 50
# -- Nodes --------------------------------------------------------------- # -- Nodes ---------------------------------------------------------------
# Each [[nodes]] entry declares a neuron daemon in the fleet. # Each [[nodes]] entry declares a mistral.rs instance in the fleet.
# Models are discovered by polling the neuron's /models endpoint. # Models are discovered by polling the node's /v1/models endpoint.
# Pinned models (see models.toml) are never evicted. # Pinned models are never evicted.
[[nodes]] [[nodes]]
name = "gpu-large" name = "gpu-large"

View File

@@ -1,5 +1,5 @@
Name: cortex Name: cortex
Version: 0.1.16 Version: 0.1.8
Release: 1%{?dist} Release: 1%{?dist}
Summary: Inference gateway for multi-node GPU clusters Summary: Inference gateway for multi-node GPU clusters
@@ -21,7 +21,6 @@ BuildRequires: systemd-rpm-macros
Requires(pre): shadow-utils Requires(pre): shadow-utils
Requires: systemd Requires: systemd
Requires: firewalld-filesystem
# systemd-rpm-macros ships a unit dep generator that parses User=/Group= # systemd-rpm-macros ships a unit dep generator that parses User=/Group=
# from our .service file and emits Requires: user(cortex)/group(cortex). # from our .service file and emits Requires: user(cortex)/group(cortex).
@@ -57,7 +56,6 @@ cargo build --release -p cortex-cli
install -Dm755 target/release/cortex %{buildroot}%{_bindir}/cortex install -Dm755 target/release/cortex %{buildroot}%{_bindir}/cortex
install -Dm644 data/cortex.service %{buildroot}%{_unitdir}/cortex.service install -Dm644 data/cortex.service %{buildroot}%{_unitdir}/cortex.service
install -Dm644 data/cortex-sysusers.conf %{buildroot}%{_sysusersdir}/cortex.conf install -Dm644 data/cortex-sysusers.conf %{buildroot}%{_sysusersdir}/cortex.conf
install -Dm644 data/cortex-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/cortex.xml
install -dm755 %{buildroot}%{_sysconfdir}/cortex install -dm755 %{buildroot}%{_sysconfdir}/cortex
install -Dm644 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml install -Dm644 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml
install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
@@ -80,21 +78,10 @@ install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
%{_bindir}/cortex %{_bindir}/cortex
%{_unitdir}/cortex.service %{_unitdir}/cortex.service
%{_sysusersdir}/cortex.conf %{_sysusersdir}/cortex.conf
%{_prefix}/lib/firewalld/services/cortex.xml
%dir %{_sysconfdir}/cortex %dir %{_sysconfdir}/cortex
%config(noreplace) %{_sysconfdir}/cortex/cortex.toml %config(noreplace) %{_sysconfdir}/cortex/cortex.toml
%config(noreplace) %{_sysconfdir}/cortex/models.toml %config(noreplace) %{_sysconfdir}/cortex/models.toml
%changelog %changelog
* Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.16-1
- chore: ignore local deploy script
- chore: move default ports out of common-collision ranges
- ci: drop actions/cache for cargo registry and target
* Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.14-1
- ci: publish both packages to a single helexa/helexa COPR project
- fix(rpm): rename neuron package to helexa-neuron
- ci: commit generated %changelog entries back to main
* Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1 * Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1
- Initial package - Initial package

View File

@@ -5,7 +5,7 @@ use tracing_subscriber::EnvFilter;
#[derive(Parser)] #[derive(Parser)]
#[command(name = "cortex")] #[command(name = "cortex")]
#[command(about = "Unified inference gateway for multi-node GPU clusters")] #[command(about = "Unified inference gateway for multi-node mistral.rs clusters")]
#[command(version)] #[command(version)]
struct Cli { struct Cli {
#[command(subcommand)] #[command(subcommand)]
@@ -23,7 +23,7 @@ enum Commands {
/// Print the fleet status (models, nodes, health). /// Print the fleet status (models, nodes, health).
Status { Status {
/// Gateway API endpoint to query. /// Gateway API endpoint to query.
#[arg(short, long, default_value = "http://localhost:31313")] #[arg(short, long, default_value = "http://localhost:8000")]
endpoint: String, endpoint: String,
}, },
} }

View File

@@ -2,7 +2,7 @@
//! //!
//! These mirror the `/v1/messages` format used by the Anthropic API. //! These mirror the `/v1/messages` format used by the Anthropic API.
//! The gateway accepts these, translates to OpenAI format, proxies to //! The gateway accepts these, translates to OpenAI format, proxies to
//! the inference backend (neuron), then translates the response back. //! mistral.rs, then translates the response back.
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_json::Value; use serde_json::Value;

View File

@@ -22,9 +22,9 @@ fn default_models_path() -> String {
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct GatewaySettings { pub struct GatewaySettings {
/// Address to listen on for API requests (e.g. "0.0.0.0:31313") /// Address to listen on for API requests (e.g. "0.0.0.0:8000")
pub listen: String, pub listen: String,
/// Address to listen on for Prometheus metrics (e.g. "0.0.0.0:31314") /// Address to listen on for Prometheus metrics (e.g. "0.0.0.0:9100")
pub metrics_listen: String, pub metrics_listen: String,
} }
@@ -50,7 +50,7 @@ pub enum EvictionStrategy {
pub struct NeuronEndpoint { pub struct NeuronEndpoint {
/// Human-readable node name (e.g. "beast") /// Human-readable node name (e.g. "beast")
pub name: String, pub name: String,
/// Base URL of the neuron daemon (e.g. "http://beast.internal:13131") /// Base URL of the neuron daemon (e.g. "http://beast.internal:9090")
pub endpoint: String, pub endpoint: String,
} }
@@ -70,8 +70,8 @@ impl Default for GatewayConfig {
fn default() -> Self { fn default() -> Self {
Self { Self {
gateway: GatewaySettings { gateway: GatewaySettings {
listen: "0.0.0.0:31313".into(), listen: "0.0.0.0:8000".into(),
metrics_listen: "0.0.0.0:31314".into(), metrics_listen: "0.0.0.0:9100".into(),
}, },
eviction: EvictionSettings { eviction: EvictionSettings {
strategy: EvictionStrategy::Lru, strategy: EvictionStrategy::Lru,

View File

@@ -9,13 +9,13 @@ use async_trait::async_trait;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
/// Configuration for a harness instance on a neuron. /// Configuration for a harness instance on a neuron.
///
/// All current harnesses are in-process (candle); per-harness tuning
/// (cache paths, device policies, etc.) lives in dedicated config
/// blocks rather than on this struct.
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HarnessConfig { pub struct HarnessConfig {
pub name: String, pub name: String,
/// Base URL of the harness (e.g. "http://localhost:8080" for mistral.rs).
pub endpoint: Option<String>,
/// Systemd unit name, if the harness is managed via systemd.
pub systemd_unit: Option<String>,
} }
/// Health status of a harness process. /// Health status of a harness process.
@@ -47,24 +47,16 @@ pub struct ModelInfo {
} }
/// What an inference harness must do, from neuron's perspective. /// What an inference harness must do, from neuron's perspective.
///
/// All current harnesses are in-process — they share neuron's address
/// space and lifecycle. `start`/`stop` therefore default to no-ops; a
/// future process-supervising harness would override them.
#[async_trait] #[async_trait]
pub trait Harness: Send + Sync { pub trait Harness: Send + Sync {
/// Human-readable name (e.g. "candle"). /// Human-readable name (e.g. "mistralrs", "llamacpp", "comfyui").
fn name(&self) -> &str; fn name(&self) -> &str;
/// Start the harness. Default no-op for in-process harnesses. /// Start the harness process if it is not already running.
async fn start(&self, _config: &HarnessConfig) -> Result<()> { async fn start(&self, config: &HarnessConfig) -> Result<()>;
Ok(())
}
/// Stop the harness. Default no-op for in-process harnesses. /// Stop the harness process gracefully.
async fn stop(&self) -> Result<()> { async fn stop(&self) -> Result<()>;
Ok(())
}
/// Health check. Returns the harness process status. /// Health check. Returns the harness process status.
async fn health(&self) -> HarnessHealth; async fn health(&self) -> HarnessHealth;

View File

@@ -6,7 +6,7 @@ use std::collections::HashMap;
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
pub struct NodeState { pub struct NodeState {
pub name: String, pub name: String,
/// Base URL of the neuron daemon (e.g. "http://beast.internal:13131"). /// Base URL of the neuron daemon (e.g. "http://beast.internal:9090").
pub endpoint: String, pub endpoint: String,
pub healthy: bool, pub healthy: bool,
pub models: HashMap<String, ModelEntry>, pub models: HashMap<String, ModelEntry>,

View File

@@ -3,7 +3,7 @@
//! These are a subset sufficient for chat completions (streaming + non-streaming). //! These are a subset sufficient for chat completions (streaming + non-streaming).
//! Fields not relevant to proxying are captured as `serde_json::Value` via //! Fields not relevant to proxying are captured as `serde_json::Value` via
//! `#[serde(flatten)]` so we forward them without needing to enumerate every //! `#[serde(flatten)]` so we forward them without needing to enumerate every
//! extension field a backend might support. //! extension field mistral.rs supports.
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_json::Value; use serde_json::Value;
@@ -22,7 +22,7 @@ pub struct ChatCompletionRequest {
pub max_tokens: Option<u64>, pub max_tokens: Option<u64>,
#[serde(skip_serializing_if = "Option::is_none")] #[serde(skip_serializing_if = "Option::is_none")]
pub stream: Option<bool>, pub stream: Option<bool>,
/// All other fields (tools, response_format, backend extensions, etc.) /// All other fields (tools, response_format, mistral.rs extensions, etc.)
#[serde(flatten)] #[serde(flatten)]
pub extra: Value, pub extra: Value,
} }

View File

@@ -1,4 +1,4 @@
//! Streaming HTTP reverse proxy to neuron backends. //! Streaming HTTP reverse proxy to mistral.rs backends.
//! //!
//! For streaming requests, SSE chunks are forwarded as they arrive. //! For streaming requests, SSE chunks are forwarded as they arrive.
//! The proxy captures timing information for metrics but does not //! The proxy captures timing information for metrics but does not

View File

@@ -22,7 +22,6 @@ use tokio::net::TcpListener;
/// - GET /models/:id/endpoint (returns the inference URL) /// - GET /models/:id/endpoint (returns the inference URL)
/// - POST /models/unload (accepts unload requests) /// - POST /models/unload (accepts unload requests)
/// - GET /v1/chat/completions + POST /v1/chat/completions (inference) /// - GET /v1/chat/completions + POST /v1/chat/completions (inference)
///
/// Returns the neuron base URL. /// Returns the neuron base URL.
pub async fn spawn_mock_neuron() -> String { pub async fn spawn_mock_neuron() -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap(); let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
@@ -55,7 +54,7 @@ pub async fn spawn_mock_neuron() -> String {
async fn mock_neuron_list_models() -> Json<Value> { async fn mock_neuron_list_models() -> Json<Value> {
Json(json!([ Json(json!([
{"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000} {"id": "test-model", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000}
])) ]))
} }

View File

@@ -12,8 +12,8 @@ use std::sync::Arc;
async fn test_poller_discovers_models() { async fn test_poller_discovers_models() {
// Mock neuron reports 2 models via /models endpoint (neuron format). // Mock neuron reports 2 models via /models endpoint (neuron format).
let mock_url = common::spawn_mock_neuron_with_models(json!([ let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "model-a", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000}, {"id": "model-a", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000},
{"id": "model-b", "harness": "candle", "status": "unloaded", "devices": [], "vram_used_mb": null} {"id": "model-b", "harness": "mistralrs", "status": "unloaded", "devices": [], "vram_used_mb": null}
])) ]))
.await; .await;
@@ -63,8 +63,8 @@ async fn test_poller_discovers_models() {
#[tokio::test] #[tokio::test]
async fn test_poller_updates_gateway_models_endpoint() { async fn test_poller_updates_gateway_models_endpoint() {
let mock_url = common::spawn_mock_neuron_with_models(json!([ let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "model-x", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}, {"id": "model-x", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null},
{"id": "model-y", "harness": "candle", "status": "loaded", "devices": [1], "vram_used_mb": null} {"id": "model-y", "harness": "mistralrs", "status": "loaded", "devices": [1], "vram_used_mb": null}
])) ]))
.await; .await;
@@ -152,8 +152,8 @@ async fn test_poller_marks_unreachable_node_unhealthy() {
#[tokio::test] #[tokio::test]
async fn test_poller_removes_stale_models() { async fn test_poller_removes_stale_models() {
let mock_url = common::spawn_mock_neuron_with_models(json!([ let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}, {"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null},
{"id": "drop-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null} {"id": "drop-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}
])) ]))
.await; .await;
@@ -183,7 +183,7 @@ async fn test_poller_removes_stale_models() {
// New mock with only one model. // New mock with only one model.
let new_mock_url = common::spawn_mock_neuron_with_models(json!([ let new_mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null} {"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}
])) ]))
.await; .await;

View File

@@ -51,18 +51,18 @@ async fn test_streaming_sse_passthrough() {
} }
assert!( assert!(
chunks.len() > chunk_count, chunks.len() >= chunk_count + 1,
"expected more than {} chunks (got {}): {:?}", "expected at least {} chunks (got {}): {:?}",
chunk_count, chunk_count + 1,
chunks.len(), chunks.len(),
chunks, chunks,
); );
assert_eq!(chunks.last().unwrap(), "[DONE]"); assert_eq!(chunks.last().unwrap(), "[DONE]");
for (i, chunk) in chunks.iter().enumerate().take(chunk_count) { for i in 0..chunk_count {
let chunk_json: serde_json::Value = let chunk_json: serde_json::Value =
serde_json::from_str(chunk).expect("chunk should be valid JSON"); serde_json::from_str(&chunks[i]).expect("chunk should be valid JSON");
assert_eq!( assert_eq!(
chunk_json["choices"][0]["delta"]["content"], chunk_json["choices"][0]["delta"]["content"],
format!("token{i}") format!("token{i}")

View File

@@ -12,30 +12,6 @@ path = "src/lib.rs"
name = "neuron" name = "neuron"
path = "src/main.rs" path = "src/main.rs"
[features]
default = []
# Enables CUDA acceleration in candle. Without this feature, candle
# compiles for CPU only and Device::new_cuda calls fall back to CPU.
cuda = [
"candle-core/cuda",
"candle-nn/cuda",
"candle-transformers/cuda",
]
# Use cuDNN for convolution / attention kernels. Requires CUDA.
cudnn = [
"cuda",
"candle-core/cudnn",
"candle-nn/cudnn",
"candle-transformers/cudnn",
]
# FlashAttention kernels. Requires CUDA.
flash-attn = [
"cuda",
"candle-transformers/flash-attn",
]
# Reserved for GPU-only integration tests in later stages.
cuda-integration = ["cuda"]
[dependencies] [dependencies]
cortex-core.workspace = true cortex-core.workspace = true
tokio.workspace = true tokio.workspace = true
@@ -48,21 +24,9 @@ tracing-subscriber.workspace = true
anyhow.workspace = true anyhow.workspace = true
async-trait.workspace = true async-trait.workspace = true
clap.workspace = true clap.workspace = true
thiserror.workspace = true
futures.workspace = true
tokio-stream.workspace = true
figment.workspace = true figment.workspace = true
toml.workspace = true toml.workspace = true
# candle for in-process inference. CUDA support is gated behind the
# crate's `cuda` feature (default off) so the workspace builds on
# non-CUDA hosts and CI runners.
candle-core = "0.10.2"
candle-nn = "0.10.2"
candle-transformers = "0.10.2"
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
hf-hub = { version = "0.4", features = ["tokio"] }
[dev-dependencies] [dev-dependencies]
tokio = { workspace = true, features = ["test-util"] } tokio = { workspace = true, features = ["test-util"] }
reqwest.workspace = true reqwest.workspace = true

View File

@@ -1,33 +1,23 @@
//! HTTP API handlers for the neuron daemon. //! HTTP API handlers for the neuron daemon.
use crate::harness::HarnessRegistry; use crate::harness::HarnessRegistry;
use crate::harness::candle::{CandleHarness, InferenceError};
use crate::health::HealthCache; use crate::health::HealthCache;
use axum::Router; use axum::Router;
use axum::extract::{Path, State}; use axum::extract::{Path, State};
use axum::http::StatusCode; use axum::http::StatusCode;
use axum::response::sse::{Event, KeepAlive, Sse};
use axum::response::{IntoResponse, Json}; use axum::response::{IntoResponse, Json};
use axum::routing::{get, post}; use axum::routing::{get, post};
use cortex_core::discovery::{DiscoveryResponse, HealthResponse}; use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
use cortex_core::harness::ModelSpec; use cortex_core::harness::ModelSpec;
use cortex_core::openai::ChatCompletionRequest;
use futures::stream::{self, StreamExt};
use serde_json::{Value, json}; use serde_json::{Value, json};
use std::convert::Infallible;
use std::sync::Arc; use std::sync::Arc;
use tokio::sync::RwLock; use tokio::sync::RwLock;
use tokio_stream::wrappers::ReceiverStream;
/// Shared state for the neuron HTTP server. /// Shared state for the neuron HTTP server.
pub struct NeuronState { pub struct NeuronState {
pub discovery: DiscoveryResponse, pub discovery: DiscoveryResponse,
pub health_cache: Arc<HealthCache>, pub health_cache: Arc<HealthCache>,
pub registry: RwLock<HarnessRegistry>, pub registry: RwLock<HarnessRegistry>,
/// Typed handle to the candle harness for inference routes. Cached at
/// startup so `/v1/chat/completions` doesn't have to hold the registry
/// read lock or perform dyn-Trait dispatch per request.
pub candle: Option<Arc<CandleHarness>>,
} }
/// Build the neuron API router. /// Build the neuron API router.
@@ -39,7 +29,6 @@ pub fn neuron_routes() -> Router<Arc<NeuronState>> {
.route("/models/load", post(load_model)) .route("/models/load", post(load_model))
.route("/models/unload", post(unload_model)) .route("/models/unload", post(unload_model))
.route("/models/{model_id}/endpoint", get(model_endpoint)) .route("/models/{model_id}/endpoint", get(model_endpoint))
.route("/v1/chat/completions", post(chat_completions))
} }
async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> { async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> {
@@ -56,7 +45,7 @@ async fn list_models(State(state): State<Arc<NeuronState>>) -> impl IntoResponse
Ok(models) => Json(json!(models)).into_response(), Ok(models) => Json(json!(models)).into_response(),
Err(e) => ( Err(e) => (
StatusCode::INTERNAL_SERVER_ERROR, StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})), Json(json!({"error": e.to_string()})),
) )
.into_response(), .into_response(),
} }
@@ -69,22 +58,11 @@ async fn load_model(
let registry = state.registry.read().await; let registry = state.registry.read().await;
match registry.load_model(&spec).await { match registry.load_model(&spec).await {
Ok(()) => Json(json!({"status": "loaded"})).into_response(), Ok(()) => Json(json!({"status": "loaded"})).into_response(),
Err(e) => { Err(e) => (
// Log the full anyhow chain server-side so journalctl shows StatusCode::BAD_REQUEST,
// the underlying failure (hf-hub timeout, permission denied, Json(json!({"error": e.to_string()})),
// disk full, etc.) without needing to inspect the HTTP )
// response body separately. .into_response(),
tracing::warn!(
model = %spec.model_id,
error = %format!("{e:#}"),
"load_model failed"
);
(
StatusCode::BAD_REQUEST,
Json(json!({"error": format!("{e:#}")})),
)
.into_response()
}
} }
} }
@@ -106,11 +84,7 @@ async fn unload_model(
let registry = state.registry.read().await; let registry = state.registry.read().await;
match registry.unload_model(&model_id).await { match registry.unload_model(&model_id).await {
Ok(()) => Json(json!({"status": "unloaded"})).into_response(), Ok(()) => Json(json!({"status": "unloaded"})).into_response(),
Err(e) => ( Err(e) => (StatusCode::NOT_FOUND, Json(json!({"error": e.to_string()}))).into_response(),
StatusCode::NOT_FOUND,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
} }
} }
@@ -128,61 +102,3 @@ async fn model_endpoint(
.into_response(), .into_response(),
} }
} }
/// OpenAI-compatible chat completions. Dispatches to streaming SSE when
/// `stream: true` is set on the request; otherwise returns a single
/// `ChatCompletionResponse`.
async fn chat_completions(
State(state): State<Arc<NeuronState>>,
Json(req): Json<ChatCompletionRequest>,
) -> impl IntoResponse {
let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
return (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({"error": "candle harness not enabled on this neuron"})),
)
.into_response();
};
if req.stream.unwrap_or(false) {
match candle.chat_completion_stream(req).await {
Ok(rx) => {
// Each chunk → one SSE `data: {json}` line. After the
// channel closes, append the OpenAI [DONE] terminator.
let body_stream = ReceiverStream::new(rx).map(|chunk| {
let body = serde_json::to_string(&chunk).unwrap_or_default();
Ok::<_, Infallible>(Event::default().data(body))
});
let done_stream =
stream::once(async { Ok::<_, Infallible>(Event::default().data("[DONE]")) });
Sse::new(body_stream.chain(done_stream))
.keep_alive(KeepAlive::default())
.into_response()
}
Err(InferenceError::ModelNotLoaded(id)) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
Err(InferenceError::Other(e)) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
}
} else {
match candle.chat_completion(req).await {
Ok(resp) => Json(resp).into_response(),
Err(InferenceError::ModelNotLoaded(id)) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
Err(InferenceError::Other(e)) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
}
}
}

View File

@@ -1,12 +1,12 @@
//! Neuron configuration loaded from neuron.toml. //! Neuron configuration loaded from neuron.toml.
use cortex_core::harness::{HarnessConfig, ModelSpec}; use cortex_core::harness::HarnessConfig;
use figment::{ use figment::{
Figment, Figment,
providers::{Env, Format, Toml}, providers::{Env, Format, Toml},
}; };
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf}; use std::path::Path;
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct NeuronConfig { pub struct NeuronConfig {
@@ -14,35 +14,10 @@ pub struct NeuronConfig {
pub port: u16, pub port: u16,
#[serde(default)] #[serde(default)]
pub harnesses: Vec<HarnessConfig>, pub harnesses: Vec<HarnessConfig>,
/// Per-harness configuration. Currently only `candle` is recognised.
#[serde(default)]
pub harness: HarnessSettings,
/// Models to auto-load when the neuron service activates. Each entry
/// is loaded sequentially before the HTTP listener binds. A failure
/// on any single entry logs a warning and proceeds — broken entries
/// don't prevent the rest of the fleet from starting.
#[serde(default)]
pub default_models: Vec<ModelSpec>,
}
/// Settings for individual harness implementations. Each harness owns
/// its own sub-table so users only configure the harnesses they enable.
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct HarnessSettings {
#[serde(default)]
pub candle: CandleHarnessConfig,
}
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct CandleHarnessConfig {
/// HuggingFace cache directory for model weights.
/// When unset, defers to hf-hub's default (~/.cache/huggingface).
#[serde(default)]
pub hf_cache: Option<PathBuf>,
} }
fn default_port() -> u16 { fn default_port() -> u16 {
13131 9090
} }
impl NeuronConfig { impl NeuronConfig {
@@ -58,10 +33,8 @@ impl NeuronConfig {
impl Default for NeuronConfig { impl Default for NeuronConfig {
fn default() -> Self { fn default() -> Self {
Self { Self {
port: 13131, port: 9090,
harnesses: vec![], harnesses: vec![],
harness: HarnessSettings::default(),
default_models: vec![],
} }
} }
} }

View File

@@ -1,687 +0,0 @@
//! Candle harness — in-process inference using huggingface/candle.
//!
//! This is the sole `Harness` implementation. Inference runs inside
//! the neuron process; there is no external subprocess.
//!
//! - Stage 2 wired GGUF (Qwen3 only) load/unload via `quantized_qwen3`.
//! - Stage 3 (this) adds `chat_completion` — a non-streaming OpenAI
//! compatible chat completion routed to the loaded model's forward
//! pass on a per-model serialised generation loop.
use anyhow::{Context, Result};
use async_trait::async_trait;
use candle_core::quantized::gguf_file;
use candle_core::{Device, Tensor};
use candle_transformers::generation::{LogitsProcessor, Sampling};
use candle_transformers::models::quantized_qwen3::ModelWeights as QuantizedQwen3Weights;
use cortex_core::harness::{Harness, HarnessHealth, ModelInfo, ModelSpec};
use cortex_core::openai::{
ChatCompletionChoice, ChatCompletionChunk, ChatCompletionRequest, ChatCompletionResponse,
ChatMessage, ChunkChoice, MessageContent, Usage,
};
use serde_json::json;
use std::collections::HashMap;
use std::path::PathBuf;
use std::sync::Arc;
use std::time::{SystemTime, UNIX_EPOCH};
use tokenizers::Tokenizer;
use tokio::sync::{Mutex, RwLock, mpsc};
/// In-process candle harness. Owns the loaded model registry.
pub struct CandleHarness {
models: Arc<RwLock<HashMap<String, Arc<LoadedModel>>>>,
hf_cache: Option<PathBuf>,
bind_url: String,
}
/// A loaded model with its tokenizer, device placement, and architecture-
/// specific weights. The `arch` is `Arc<Mutex<>>` so the lock guard can be
/// moved into `spawn_blocking` for synchronous candle forward passes.
pub struct LoadedModel {
pub model_id: String,
pub arch: Arc<Mutex<ModelArch>>,
pub tokenizer: Tokenizer,
pub device: Device,
pub quant: Option<String>,
pub devices: Vec<u32>,
}
/// Architecture-specific weights. Stage 3 still supports only Qwen3
/// quantized; Stage 8 broadens this to additional families and
/// non-quantized variants.
pub enum ModelArch {
Qwen3Quantized(QuantizedQwen3Weights),
}
impl CandleHarness {
pub fn new(bind_url: String, hf_cache: Option<PathBuf>) -> Self {
Self {
models: Arc::new(RwLock::new(HashMap::new())),
hf_cache,
bind_url,
}
}
/// Pick a candle `Device` for the requested indices. Without the
/// `cuda` feature, or if CUDA initialisation fails, falls back to CPU.
fn pick_device(devices: &[u32]) -> Result<Device> {
let _idx = devices.first().copied().unwrap_or(0) as usize;
#[cfg(feature = "cuda")]
{
match Device::new_cuda(_idx) {
Ok(d) => return Ok(d),
Err(e) => tracing::warn!(
device = _idx,
error = %e,
"CUDA device unavailable, falling back to CPU"
),
}
}
Ok(Device::Cpu)
}
/// Resolve a model spec to local GGUF and tokenizer file paths via
/// hf-hub. Downloads on first use; subsequent calls are cached.
async fn resolve_files(&self, spec: &ModelSpec) -> Result<(PathBuf, PathBuf)> {
let mut builder = hf_hub::api::tokio::ApiBuilder::new();
if let Some(cache) = &self.hf_cache {
builder = builder.with_cache_dir(cache.clone());
}
let api = builder.build().context("build hf-hub API")?;
let repo = api.model(spec.model_id.clone());
let info = repo
.info()
.await
.with_context(|| format!("fetch HF repo info for {}", spec.model_id))?;
let quant = spec.quant.as_deref().unwrap_or("");
let quant_lc = quant.to_lowercase();
let gguf_filename = info
.siblings
.iter()
.map(|s| s.rfilename.as_str())
.filter(|name| name.to_lowercase().ends_with(".gguf"))
.find(|name| quant_lc.is_empty() || name.to_lowercase().contains(&quant_lc))
.ok_or_else(|| {
anyhow::anyhow!(
"no GGUF file matching quant {:?} in repo {}",
spec.quant,
spec.model_id
)
})?
.to_string();
tracing::info!(
model = %spec.model_id,
file = %gguf_filename,
"resolving GGUF (may be cached)"
);
let gguf_path = repo
.get(&gguf_filename)
.await
.with_context(|| format!("fetch GGUF {gguf_filename}"))?;
// GGUF-only HF repos (unsloth/Qwen3-*-GGUF, Qwen/Qwen3-*-GGUF,
// etc.) ship the .gguf file but not tokenizer.json — the
// tokenizer.json lives in the base non-GGUF repo. Derive the
// base repo id by stripping a `-GGUF` / `-gguf` suffix; if
// there's no such suffix the same repo is used (works for
// non-GGUF model_ids).
let tokenizer_repo_id = spec
.model_id
.strip_suffix("-GGUF")
.or_else(|| spec.model_id.strip_suffix("-gguf"))
.unwrap_or(spec.model_id.as_str())
.to_string();
let tokenizer_repo = if tokenizer_repo_id == spec.model_id {
repo
} else {
tracing::debug!(
from = %spec.model_id,
to = %tokenizer_repo_id,
"tokenizer.json sourced from base repo (GGUF suffix stripped)"
);
api.model(tokenizer_repo_id.clone())
};
let tokenizer_path = tokenizer_repo
.get("tokenizer.json")
.await
.with_context(|| format!("fetch tokenizer.json from {tokenizer_repo_id}"))?;
Ok((gguf_path, tokenizer_path))
}
/// Run a non-streaming chat completion against a loaded model.
///
/// Returns a typed `InferenceError` when the model isn't loaded so the
/// handler can map to an appropriate HTTP status without string-matching.
pub async fn chat_completion(
&self,
request: ChatCompletionRequest,
) -> Result<ChatCompletionResponse, InferenceError> {
let loaded = {
let models = self.models.read().await;
models.get(&request.model).cloned()
};
let loaded = loaded.ok_or_else(|| InferenceError::ModelNotLoaded(request.model.clone()))?;
let prompt = format_qwen3_prompt(&request.messages);
let encoding = loaded
.tokenizer
.encode(prompt.as_str(), true)
.map_err(|e| InferenceError::Other(anyhow::anyhow!("tokenize: {e}")))?;
let prompt_tokens: Vec<u32> = encoding.get_ids().to_vec();
let prompt_len = prompt_tokens.len();
let temperature = request.temperature.unwrap_or(0.7);
let top_p = request.top_p;
let max_new = request.max_tokens.unwrap_or(512) as usize;
let seed = unix_subsec_nanos();
let eos_id = loaded
.tokenizer
.token_to_id("<|im_end|>")
.or_else(|| loaded.tokenizer.token_to_id("<|endoftext|>"));
let arch_arc = Arc::clone(&loaded.arch);
let device = loaded.device.clone();
let model_id = request.model.clone();
let (generated_ids, finish_reason) =
tokio::task::spawn_blocking(move || -> Result<(Vec<u32>, String)> {
let mut guard = arch_arc.blocking_lock();
run_inference(
&mut guard,
&device,
&prompt_tokens,
max_new,
temperature,
top_p,
seed,
eos_id,
)
})
.await
.map_err(|e| InferenceError::Other(anyhow::anyhow!("inference task panicked: {e}")))?
.map_err(InferenceError::Other)?;
let completion_text = loaded
.tokenizer
.decode(&generated_ids, true)
.map_err(|e| InferenceError::Other(anyhow::anyhow!("detokenize: {e}")))?;
let usage = Usage {
prompt_tokens: prompt_len as u64,
completion_tokens: generated_ids.len() as u64,
total_tokens: (prompt_len + generated_ids.len()) as u64,
};
Ok(ChatCompletionResponse {
id: format!("chatcmpl-{:x}", unix_subsec_nanos()),
object: "chat.completion".into(),
created: unix_now_secs(),
model: model_id,
choices: vec![ChatCompletionChoice {
index: 0,
message: ChatMessage {
role: "assistant".into(),
content: MessageContent::Text(completion_text),
extra: serde_json::Value::Object(Default::default()),
},
finish_reason: Some(finish_reason),
extra: serde_json::Value::Object(Default::default()),
}],
usage: Some(usage),
extra: serde_json::Value::Object(Default::default()),
})
}
/// Run a streaming chat completion against a loaded model.
///
/// Returns an `mpsc::Receiver` that yields `ChatCompletionChunk`s in
/// OpenAI SSE format. The first chunk carries the assistant role;
/// subsequent chunks carry incremental `content` deltas; the final
/// chunk carries `finish_reason`. The handler is responsible for
/// wrapping these into an SSE response and appending the `[DONE]`
/// terminator.
///
/// Token-by-token decoding tracks the cumulative decoded prefix so
/// BPE byte-fallback boundaries don't split a UTF-8 char across
/// chunks.
pub async fn chat_completion_stream(
&self,
request: ChatCompletionRequest,
) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
let loaded = {
let models = self.models.read().await;
models.get(&request.model).cloned()
};
let loaded = loaded.ok_or_else(|| InferenceError::ModelNotLoaded(request.model.clone()))?;
let prompt = format_qwen3_prompt(&request.messages);
let encoding = loaded
.tokenizer
.encode(prompt.as_str(), true)
.map_err(|e| InferenceError::Other(anyhow::anyhow!("tokenize: {e}")))?;
let prompt_tokens: Vec<u32> = encoding.get_ids().to_vec();
let temperature = request.temperature.unwrap_or(0.7);
let top_p = request.top_p;
let max_new = request.max_tokens.unwrap_or(512) as usize;
let seed = unix_subsec_nanos();
let eos_id = loaded
.tokenizer
.token_to_id("<|im_end|>")
.or_else(|| loaded.tokenizer.token_to_id("<|endoftext|>"));
let arch_arc = Arc::clone(&loaded.arch);
let device = loaded.device.clone();
let tokenizer = loaded.tokenizer.clone();
let model_id = request.model.clone();
let id = format!("chatcmpl-{:x}", unix_subsec_nanos());
let created = unix_now_secs();
// Bounded channel so the producer (blocking inference) is back-
// pressured by the consumer (SSE writer). 32 is generous —
// tokens arrive one at a time and the SSE writer is async.
let (tx, rx) = mpsc::channel::<ChatCompletionChunk>(32);
// Lead chunk: announce the assistant role per OpenAI streaming
// conventions. Tools that auto-detect a streaming reply expect
// this before any content delta.
let role_chunk = ChatCompletionChunk {
id: id.clone(),
object: "chat.completion.chunk".into(),
created,
model: model_id.clone(),
choices: vec![ChunkChoice {
index: 0,
delta: json!({"role": "assistant"}),
finish_reason: None,
extra: serde_json::Value::Object(Default::default()),
}],
usage: None,
extra: serde_json::Value::Object(Default::default()),
};
// If sending the role chunk fails the receiver is already gone;
// bail before kicking off the heavy blocking work.
tx.send(role_chunk)
.await
.map_err(|_| InferenceError::Other(anyhow::anyhow!("client disconnected")))?;
tokio::task::spawn_blocking(move || {
let mut guard = arch_arc.blocking_lock();
if let Err(e) = run_inference_streaming(
&mut guard,
&device,
&tokenizer,
&prompt_tokens,
max_new,
temperature,
top_p,
seed,
eos_id,
&id,
created,
&model_id,
&tx,
) {
tracing::warn!(model = %model_id, error = %e, "streaming inference failed");
}
});
Ok(rx)
}
}
#[async_trait]
impl Harness for CandleHarness {
fn name(&self) -> &str {
"candle"
}
async fn health(&self) -> HarnessHealth {
HarnessHealth {
name: "candle".into(),
running: true,
uptime_secs: None,
}
}
async fn list_models(&self) -> Result<Vec<ModelInfo>> {
let models = self.models.read().await;
Ok(models
.values()
.map(|m| ModelInfo {
id: m.model_id.clone(),
harness: "candle".into(),
status: "loaded".into(),
devices: m.devices.clone(),
vram_used_mb: None,
})
.collect())
}
async fn load_model(&self, spec: &ModelSpec) -> Result<()> {
if spec.harness != "candle" {
anyhow::bail!("expected harness=candle, got harness={}", spec.harness);
}
{
let models = self.models.read().await;
if models.contains_key(&spec.model_id) {
anyhow::bail!("model '{}' already loaded", spec.model_id);
}
}
let devices = spec.devices.clone().unwrap_or_else(|| vec![0]);
let device = Self::pick_device(&devices)?;
let (gguf_path, tokenizer_path) = self.resolve_files(spec).await?;
let tokenizer = Tokenizer::from_file(&tokenizer_path)
.map_err(|e| anyhow::anyhow!("load tokenizer: {e}"))?;
// File I/O + GGUF parsing + tensor materialisation are CPU-bound,
// so run them on a blocking task to avoid stalling the runtime.
let device_for_load = device.clone();
let gguf_path_for_load = gguf_path.clone();
let model_id_for_log = spec.model_id.clone();
let arch = tokio::task::spawn_blocking(move || -> Result<ModelArch> {
tracing::info!(model = %model_id_for_log, path = ?gguf_path_for_load, "loading GGUF");
let mut file = std::fs::File::open(&gguf_path_for_load).context("open GGUF file")?;
let content = gguf_file::Content::read(&mut file)
.map_err(|e| anyhow::anyhow!("parse GGUF: {e}"))?;
let architecture = content
.metadata
.get("general.architecture")
.and_then(|v| v.to_string().ok().cloned())
.unwrap_or_default();
tracing::info!(architecture = %architecture, "GGUF architecture");
match architecture.as_str() {
"qwen3" => {
let weights =
QuantizedQwen3Weights::from_gguf(content, &mut file, &device_for_load)
.map_err(|e| anyhow::anyhow!("from_gguf qwen3: {e}"))?;
Ok(ModelArch::Qwen3Quantized(weights))
}
other => anyhow::bail!(
"unsupported GGUF architecture '{other}'; Stage 3 only supports qwen3"
),
}
})
.await
.context("blocking load task panicked")??;
let loaded = Arc::new(LoadedModel {
model_id: spec.model_id.clone(),
arch: Arc::new(Mutex::new(arch)),
tokenizer,
device,
quant: spec.quant.clone(),
devices,
});
let mut models = self.models.write().await;
models.insert(spec.model_id.clone(), loaded);
tracing::info!(model = %spec.model_id, "model loaded");
Ok(())
}
async fn unload_model(&self, model_id: &str) -> Result<()> {
let mut models = self.models.write().await;
if models.remove(model_id).is_none() {
anyhow::bail!("model '{model_id}' not loaded");
}
tracing::info!(model = %model_id, "model unloaded");
Ok(())
}
async fn inference_endpoint(&self, model_id: &str) -> Option<String> {
let models = self.models.read().await;
models.contains_key(model_id).then(|| self.bind_url.clone())
}
}
/// Errors returned by `CandleHarness::chat_completion`. The
/// `ModelNotLoaded` variant lets the HTTP handler map cleanly to 404
/// without string-matching on anyhow messages.
#[derive(Debug, thiserror::Error)]
pub enum InferenceError {
#[error("model '{0}' not loaded on this neuron")]
ModelNotLoaded(String),
#[error(transparent)]
Other(#[from] anyhow::Error),
}
/// Apply the Qwen3 chat template:
///
/// ```text
/// <|im_start|>{role}\n{content}<|im_end|>\n
/// ...
/// <|im_start|>assistant\n
/// ```
///
/// The trailing `<|im_start|>assistant\n` cues the model to begin a turn.
/// Non-text content parts (vision blocks) are joined as text only; full
/// multimodal handling is out of scope for Stage 3.
fn format_qwen3_prompt(messages: &[ChatMessage]) -> String {
let mut prompt = String::new();
for msg in messages {
let content = match &msg.content {
MessageContent::Text(s) => s.clone(),
MessageContent::Parts(parts) => parts
.iter()
.filter_map(|p| p.get("text").and_then(|v| v.as_str()))
.collect::<Vec<_>>()
.join(""),
};
prompt.push_str("<|im_start|>");
prompt.push_str(&msg.role);
prompt.push('\n');
prompt.push_str(&content);
prompt.push_str("<|im_end|>\n");
}
prompt.push_str("<|im_start|>assistant\n");
prompt
}
#[allow(clippy::too_many_arguments)]
fn run_inference(
arch: &mut ModelArch,
device: &Device,
prompt_tokens: &[u32],
max_new: usize,
temperature: f64,
top_p: Option<f64>,
seed: u64,
eos_id: Option<u32>,
) -> Result<(Vec<u32>, String)> {
let mut logits_processor = {
let sampling = if temperature <= 0.0 {
Sampling::ArgMax
} else {
match top_p {
Some(p) => Sampling::TopP { p, temperature },
None => Sampling::All { temperature },
}
};
LogitsProcessor::from_sampling(seed, sampling)
};
let mut generated: Vec<u32> = Vec::new();
let mut next_token = match arch {
ModelArch::Qwen3Quantized(model) => {
model.clear_kv_cache();
let input = Tensor::new(prompt_tokens, device)?.unsqueeze(0)?;
let logits = model.forward(&input, 0)?;
let logits = logits.squeeze(0)?;
logits_processor.sample(&logits)?
}
};
if Some(next_token) == eos_id {
return Ok((generated, "stop".into()));
}
generated.push(next_token);
for index in 0..max_new.saturating_sub(1) {
next_token = match arch {
ModelArch::Qwen3Quantized(model) => {
let input = Tensor::new(&[next_token], device)?.unsqueeze(0)?;
let logits = model.forward(&input, prompt_tokens.len() + index)?;
let logits = logits.squeeze(0)?;
logits_processor.sample(&logits)?
}
};
if Some(next_token) == eos_id {
return Ok((generated, "stop".into()));
}
generated.push(next_token);
}
Ok((generated, "length".into()))
}
/// Streaming counterpart to `run_inference`. Emits chunks via `tx` as
/// tokens are generated and exits on EOS, max_new, or receiver drop.
///
/// Detokenization tracks the cumulative decoded prefix so each chunk's
/// `content` delta is the substring appended since the last chunk —
/// safe across BPE byte-fallback boundaries.
#[allow(clippy::too_many_arguments)]
fn run_inference_streaming(
arch: &mut ModelArch,
device: &Device,
tokenizer: &Tokenizer,
prompt_tokens: &[u32],
max_new: usize,
temperature: f64,
top_p: Option<f64>,
seed: u64,
eos_id: Option<u32>,
id: &str,
created: u64,
model_id: &str,
tx: &mpsc::Sender<ChatCompletionChunk>,
) -> Result<()> {
let mut logits_processor = {
let sampling = if temperature <= 0.0 {
Sampling::ArgMax
} else {
match top_p {
Some(p) => Sampling::TopP { p, temperature },
None => Sampling::All { temperature },
}
};
LogitsProcessor::from_sampling(seed, sampling)
};
let mut all_tokens: Vec<u32> = Vec::new();
let mut decoded_prefix = String::new();
let mut finish_reason = "length".to_string();
let mut next_token = match arch {
ModelArch::Qwen3Quantized(model) => {
model.clear_kv_cache();
let input = Tensor::new(prompt_tokens, device)?.unsqueeze(0)?;
let logits = model.forward(&input, 0)?;
let logits = logits.squeeze(0)?;
logits_processor.sample(&logits)?
}
};
let emit_token = |all_tokens: &[u32], decoded_prefix: &mut String| -> Result<bool> {
let full = tokenizer
.decode(all_tokens, true)
.map_err(|e| anyhow::anyhow!("decode: {e}"))?;
if full.len() > decoded_prefix.len() {
let delta = full[decoded_prefix.len()..].to_string();
*decoded_prefix = full;
let chunk = ChatCompletionChunk {
id: id.into(),
object: "chat.completion.chunk".into(),
created,
model: model_id.into(),
choices: vec![ChunkChoice {
index: 0,
delta: json!({ "content": delta }),
finish_reason: None,
extra: serde_json::Value::Object(Default::default()),
}],
usage: None,
extra: serde_json::Value::Object(Default::default()),
};
// blocking_send returns Err if the consumer hung up — signal
// the caller to stop generating.
if tx.blocking_send(chunk).is_err() {
return Ok(false);
}
}
Ok(true)
};
if Some(next_token) == eos_id {
finish_reason = "stop".into();
} else {
all_tokens.push(next_token);
if !emit_token(&all_tokens, &mut decoded_prefix)? {
return Ok(());
}
for index in 0..max_new.saturating_sub(1) {
next_token = match arch {
ModelArch::Qwen3Quantized(model) => {
let input = Tensor::new(&[next_token], device)?.unsqueeze(0)?;
let logits = model.forward(&input, prompt_tokens.len() + index)?;
let logits = logits.squeeze(0)?;
logits_processor.sample(&logits)?
}
};
if Some(next_token) == eos_id {
finish_reason = "stop".into();
break;
}
all_tokens.push(next_token);
if !emit_token(&all_tokens, &mut decoded_prefix)? {
return Ok(());
}
}
}
let final_chunk = ChatCompletionChunk {
id: id.into(),
object: "chat.completion.chunk".into(),
created,
model: model_id.into(),
choices: vec![ChunkChoice {
index: 0,
delta: serde_json::Value::Object(Default::default()),
finish_reason: Some(finish_reason),
extra: serde_json::Value::Object(Default::default()),
}],
usage: None,
extra: serde_json::Value::Object(Default::default()),
};
let _ = tx.blocking_send(final_chunk);
Ok(())
}
fn unix_now_secs() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|d| d.as_secs())
.unwrap_or(0)
}
fn unix_subsec_nanos() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|d| d.as_nanos() as u64)
.unwrap_or(0)
}

View File

@@ -0,0 +1 @@
// llama.cpp harness implementation — Phase 11.

View File

@@ -0,0 +1,163 @@
//! mistral.rs harness implementation.
//!
//! Wraps the mistral.rs HTTP API for model lifecycle management
//! and optionally manages the process via systemd.
use anyhow::Result;
use async_trait::async_trait;
use cortex_core::harness::{Harness, HarnessConfig, HarnessHealth, ModelInfo, ModelSpec};
use reqwest::Client;
use serde::Deserialize;
pub struct MistralRsHarness {
endpoint: String,
systemd_unit: Option<String>,
client: Client,
}
impl MistralRsHarness {
pub fn new(endpoint: String, systemd_unit: Option<String>) -> Self {
Self {
endpoint,
systemd_unit,
client: Client::builder()
.timeout(std::time::Duration::from_secs(30))
.build()
.expect("failed to build HTTP client"),
}
}
}
/// Response from mistral.rs `GET /v1/models`.
#[derive(Debug, Deserialize)]
struct ModelsResponse {
data: Vec<ModelEntry>,
}
#[derive(Debug, Deserialize)]
struct ModelEntry {
id: String,
#[serde(default)]
status: Option<String>,
}
#[async_trait]
impl Harness for MistralRsHarness {
fn name(&self) -> &str {
"mistralrs"
}
async fn start(&self, _config: &HarnessConfig) -> Result<()> {
let Some(unit) = &self.systemd_unit else {
anyhow::bail!("no systemd unit configured for mistralrs harness");
};
let output = tokio::process::Command::new("systemctl")
.args(["start", unit])
.output()
.await?;
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
anyhow::bail!("systemctl start {unit} failed: {stderr}");
}
// Wait for the health endpoint to respond (up to 30s).
let url = format!("{}/health", self.endpoint);
for _ in 0..30 {
tokio::time::sleep(std::time::Duration::from_secs(1)).await;
if self.client.get(&url).send().await.is_ok() {
tracing::info!(unit, "mistralrs started and healthy");
return Ok(());
}
}
anyhow::bail!("mistralrs started but health endpoint did not respond within 30s");
}
async fn stop(&self) -> Result<()> {
let Some(unit) = &self.systemd_unit else {
anyhow::bail!("no systemd unit configured for mistralrs harness");
};
let output = tokio::process::Command::new("systemctl")
.args(["stop", unit])
.output()
.await?;
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
anyhow::bail!("systemctl stop {unit} failed: {stderr}");
}
Ok(())
}
async fn health(&self) -> HarnessHealth {
let url = format!("{}/health", self.endpoint);
let running = self.client.get(&url).send().await.is_ok();
HarnessHealth {
name: "mistralrs".into(),
running,
uptime_secs: None,
}
}
async fn list_models(&self) -> Result<Vec<ModelInfo>> {
let url = format!("{}/v1/models", self.endpoint);
let resp = self.client.get(&url).send().await?;
if !resp.status().is_success() {
anyhow::bail!("GET /v1/models returned {}", resp.status());
}
let models_resp: ModelsResponse = resp.json().await?;
Ok(models_resp
.data
.into_iter()
.map(|m| ModelInfo {
id: m.id,
harness: "mistralrs".into(),
status: m.status.unwrap_or_else(|| "loaded".into()),
devices: vec![],
vram_used_mb: None,
})
.collect())
}
async fn load_model(&self, spec: &ModelSpec) -> Result<()> {
let url = format!("{}/v1/models/reload", self.endpoint);
let resp = self
.client
.post(&url)
.json(&serde_json::json!({ "model_id": spec.model_id }))
.send()
.await?;
if !resp.status().is_success() {
let body = resp.text().await.unwrap_or_default();
anyhow::bail!("POST /v1/models/reload failed: {body}");
}
Ok(())
}
async fn unload_model(&self, model_id: &str) -> Result<()> {
let url = format!("{}/v1/models/unload", self.endpoint);
let resp = self
.client
.post(&url)
.json(&serde_json::json!({ "model_id": model_id }))
.send()
.await?;
if !resp.status().is_success() {
let body = resp.text().await.unwrap_or_default();
anyhow::bail!("POST /v1/models/unload failed: {body}");
}
Ok(())
}
async fn inference_endpoint(&self, _model_id: &str) -> Option<String> {
// mistral.rs routes internally by model name in the request body,
// so the inference endpoint is always the base URL.
Some(self.endpoint.clone())
}
}

View File

@@ -1,22 +1,15 @@
//! Harness registry — maps harness names to trait implementations. //! Harness registry — maps harness names to trait implementations.
pub mod candle; pub mod llamacpp;
pub mod mistralrs;
use anyhow::Result; use anyhow::Result;
use cortex_core::harness::{Harness, HarnessConfig, ModelInfo, ModelSpec}; use cortex_core::harness::{Harness, HarnessConfig, ModelInfo, ModelSpec};
use std::collections::HashMap; use std::collections::HashMap;
use std::sync::Arc;
/// Registry of available harness implementations. /// Registry of available harness implementations.
///
/// Holds an `Arc<dyn Harness>` per harness for generic lifecycle dispatch
/// (load/unload/list_models). When a candle harness is registered, a typed
/// `Arc<CandleHarness>` is also cached so inference routes can bypass the
/// dyn-Trait dispatch and reach harness-specific methods (chat completion,
/// streaming, etc.).
pub struct HarnessRegistry { pub struct HarnessRegistry {
harnesses: HashMap<String, Arc<dyn Harness>>, harnesses: HashMap<String, Box<dyn Harness>>,
candle: Option<Arc<candle::CandleHarness>>,
} }
impl Default for HarnessRegistry { impl Default for HarnessRegistry {
@@ -29,11 +22,10 @@ impl HarnessRegistry {
pub fn new() -> Self { pub fn new() -> Self {
Self { Self {
harnesses: HashMap::new(), harnesses: HashMap::new(),
candle: None,
} }
} }
pub fn register(&mut self, harness: Arc<dyn Harness>) { pub fn register(&mut self, harness: Box<dyn Harness>) {
self.harnesses.insert(harness.name().to_string(), harness); self.harnesses.insert(harness.name().to_string(), harness);
} }
@@ -42,12 +34,6 @@ impl HarnessRegistry {
self.harnesses.keys().cloned().collect() self.harnesses.keys().cloned().collect()
} }
/// Typed handle to the candle harness, if registered. Used by inference
/// routes that need methods beyond the `Harness` trait surface.
pub fn candle(&self) -> Option<Arc<candle::CandleHarness>> {
self.candle.clone()
}
/// List models from all registered harnesses. /// List models from all registered harnesses.
pub async fn list_all_models(&self) -> Result<Vec<ModelInfo>> { pub async fn list_all_models(&self) -> Result<Vec<ModelInfo>> {
let mut all = Vec::new(); let mut all = Vec::new();
@@ -95,25 +81,19 @@ impl HarnessRegistry {
} }
/// Build a registry from harness configs. /// Build a registry from harness configs.
/// pub fn from_configs(configs: &[HarnessConfig]) -> Self {
/// `bind_url` is the URL where this neuron serves inference (its own
/// listen address). In-process harnesses (currently the only kind)
/// return this URL from `inference_endpoint`.
pub fn from_configs(
configs: &[HarnessConfig],
bind_url: &str,
settings: &crate::config::HarnessSettings,
) -> Self {
let mut registry = Self::new(); let mut registry = Self::new();
for config in configs { for config in configs {
match config.name.as_str() { match config.name.as_str() {
"candle" => { "mistralrs" => {
let harness = Arc::new(candle::CandleHarness::new( if let Some(endpoint) = &config.endpoint {
bind_url.to_string(), registry.register(Box::new(mistralrs::MistralRsHarness::new(
settings.candle.hf_cache.clone(), endpoint.clone(),
)); config.systemd_unit.clone(),
registry.candle = Some(Arc::clone(&harness)); )));
registry.harnesses.insert("candle".into(), harness); } else {
tracing::warn!("mistralrs harness missing endpoint, skipping");
}
} }
other => { other => {
tracing::warn!(harness = other, "unknown harness type, skipping"); tracing::warn!(harness = other, "unknown harness type, skipping");

View File

@@ -3,4 +3,3 @@ pub mod config;
pub mod discovery; pub mod discovery;
pub mod harness; pub mod harness;
pub mod health; pub mod health;
pub mod startup;

View File

@@ -1,6 +1,6 @@
use anyhow::Result; use anyhow::Result;
use clap::Parser; use clap::Parser;
use neuron::{api, config::NeuronConfig, discovery, harness::HarnessRegistry, health, startup}; use neuron::{api, config::NeuronConfig, discovery, harness::HarnessRegistry, health};
use std::sync::Arc; use std::sync::Arc;
use std::time::Instant; use std::time::Instant;
use tokio::sync::RwLock; use tokio::sync::RwLock;
@@ -37,7 +37,6 @@ async fn main() -> Result<()> {
}); });
let port = args.port.unwrap_or(cfg.port); let port = args.port.unwrap_or(cfg.port);
let bind_url = format!("http://localhost:{port}");
let start_time = Instant::now(); let start_time = Instant::now();
tracing::info!("running hardware discovery"); tracing::info!("running hardware discovery");
@@ -48,18 +47,9 @@ async fn main() -> Result<()> {
"discovery complete" "discovery complete"
); );
// Build harness registry from config. In-process harnesses (candle) // Build harness registry from config.
// need to know neuron's own bind URL so they can return it from let registry = HarnessRegistry::from_configs(&cfg.harnesses);
// inference_endpoint.
let registry = HarnessRegistry::from_configs(&cfg.harnesses, &bind_url, &cfg.harness);
discovery_result.harnesses = registry.names(); discovery_result.harnesses = registry.names();
let candle = registry.candle();
// Activation: load default models before binding the listener.
// Each load may take tens of seconds to several minutes depending
// on model size and HF cache state — keep TimeoutStartSec in the
// systemd unit generous enough to cover the slowest entry.
startup::load_default_models(&registry, &cfg.default_models).await;
let health_cache = Arc::new(health::HealthCache::new()); let health_cache = Arc::new(health::HealthCache::new());
health_cache health_cache
@@ -75,24 +65,13 @@ async fn main() -> Result<()> {
discovery: discovery_result, discovery: discovery_result,
health_cache, health_cache,
registry: RwLock::new(registry), registry: RwLock::new(registry),
candle,
}); });
let app = api::neuron_routes().with_state(Arc::clone(&state)); let app = api::neuron_routes().with_state(state);
let addr: std::net::SocketAddr = format!("0.0.0.0:{port}").parse()?; let addr: std::net::SocketAddr = format!("0.0.0.0:{port}").parse()?;
tracing::info!("neuron listening on {addr}"); tracing::info!("neuron listening on {addr}");
let listener = tokio::net::TcpListener::bind(addr).await?; let listener = tokio::net::TcpListener::bind(addr).await?;
axum::serve(listener, app) axum::serve(listener, app).await?;
.with_graceful_shutdown(startup::shutdown_signal())
.await?;
// Deactivation: serve has returned (graceful shutdown signal
// received and connections drained). Release CUDA contexts / VRAM
// by unloading every model before exiting; systemd's TimeoutStopSec
// bounds how long this phase may take.
let registry = state.registry.read().await;
startup::unload_all_models(&registry).await;
tracing::info!("shutdown complete");
Ok(()) Ok(())
} }

View File

@@ -1,97 +0,0 @@
//! Activation- and deactivation-time orchestration.
//!
//! Wired from `main.rs` around the HTTP listener — activation runs
//! before bind, deactivation runs after axum returns from its
//! graceful-shutdown future. Kept in its own module so the logic is
//! unit-testable without spinning up a full neuron process.
use crate::harness::HarnessRegistry;
use cortex_core::harness::ModelSpec;
use std::time::Instant;
use tokio::signal;
/// Load each spec sequentially against the registry, treating
/// individual failures as warnings rather than fatal errors.
///
/// VRAM contention makes parallel loads risky; the sequential path is
/// boring but correct. The function logs elapsed time per load so an
/// operator can see which model is hogging activation.
pub async fn load_default_models(registry: &HarnessRegistry, specs: &[ModelSpec]) {
if specs.is_empty() {
return;
}
tracing::info!(count = specs.len(), "loading default models");
for spec in specs {
let start = Instant::now();
match registry.load_model(spec).await {
Ok(()) => tracing::info!(
model = %spec.model_id,
elapsed_ms = start.elapsed().as_millis() as u64,
"loaded default model"
),
Err(e) => tracing::warn!(
model = %spec.model_id,
error = %e,
elapsed_ms = start.elapsed().as_millis() as u64,
"failed to load default model, continuing"
),
}
}
}
/// Future that resolves on SIGINT (Ctrl-C) or SIGTERM (systemd stop).
///
/// Wired into `axum::serve(...).with_graceful_shutdown(shutdown_signal())`
/// so the HTTP listener stops accepting new connections, lets in-flight
/// requests drain, and then yields control back to main for cleanup.
pub async fn shutdown_signal() {
let ctrl_c = async {
signal::ctrl_c().await.ok();
};
let terminate = async {
signal::unix::signal(signal::unix::SignalKind::terminate())
.expect("install SIGTERM handler")
.recv()
.await;
};
tokio::select! {
_ = ctrl_c => tracing::info!("received SIGINT, shutting down"),
_ = terminate => tracing::info!("received SIGTERM, shutting down"),
}
}
/// Unload every model currently registered. Called from `main.rs` after
/// axum's graceful shutdown future resolves, so CUDA contexts and VRAM
/// are released before the process exits rather than left to the OS to
/// reclaim. Per-model failures are logged and skipped — keep cleanup
/// going even when one harness is unhealthy.
pub async fn unload_all_models(registry: &HarnessRegistry) {
let listed = match registry.list_all_models().await {
Ok(m) => m,
Err(e) => {
tracing::warn!(error = %e, "failed to list models during shutdown");
return;
}
};
if listed.is_empty() {
return;
}
tracing::info!(count = listed.len(), "unloading models for shutdown");
for model in listed {
let start = Instant::now();
match registry.unload_model(&model.id).await {
Ok(()) => tracing::info!(
model = %model.id,
elapsed_ms = start.elapsed().as_millis() as u64,
"unloaded"
),
Err(e) => tracing::warn!(
model = %model.id,
error = %e,
"unload failed during shutdown"
),
}
}
}

View File

@@ -1,56 +0,0 @@
//! Activation-time behaviour: load_default_models continues past
//! individual failures so a single broken catalogue entry doesn't
//! prevent the rest of the fleet from starting.
use cortex_core::harness::{HarnessConfig, ModelSpec};
use neuron::config::HarnessSettings;
use neuron::harness::HarnessRegistry;
use neuron::startup;
#[tokio::test]
async fn test_load_default_models_skips_unknown_harness() {
let registry = HarnessRegistry::from_configs(
&[HarnessConfig {
name: "candle".into(),
}],
"http://localhost:0",
&HarnessSettings::default(),
);
// Both entries fail synchronously inside the registry — no network
// call escapes (the harness lookup mismatches before hf-hub is
// touched). The function should still return cleanly.
let specs = vec![
ModelSpec {
model_id: "model-a".into(),
harness: "no-such-harness".into(),
quant: None,
tensor_parallel: None,
devices: None,
},
ModelSpec {
model_id: "model-b".into(),
harness: "no-such-harness".into(),
quant: None,
tensor_parallel: None,
devices: None,
},
];
startup::load_default_models(&registry, &specs).await;
let listed = registry
.list_all_models()
.await
.expect("list_all_models should succeed");
assert!(
listed.is_empty(),
"no models should be loaded after failed entries"
);
}
#[tokio::test]
async fn test_load_default_models_empty_is_noop() {
let registry = HarnessRegistry::new();
startup::load_default_models(&registry, &[]).await;
}

View File

@@ -14,7 +14,6 @@ async fn spawn_neuron(discovery: DiscoveryResponse) -> String {
discovery, discovery,
health_cache, health_cache,
registry: RwLock::new(registry), registry: RwLock::new(registry),
candle: None,
}); });
let app = api::neuron_routes().with_state(state); let app = api::neuron_routes().with_state(state);
@@ -136,30 +135,56 @@ async fn test_models_empty_registry() {
assert!(body.as_array().unwrap().is_empty()); assert!(body.as_array().unwrap().is_empty());
} }
/// Verify the candle harness registers, list is empty by default, and a /// Spawn a mock mistral.rs backend and a neuron with the mistralrs harness
/// load attempt for an obviously-bogus model id returns a 4xx error /// pointing at it, then test the full model lifecycle through neuron's API.
/// without crashing the daemon. Real load/unload exercising actual GGUF
/// download is covered by `tests/candle_lifecycle.rs` (cuda-integration).
#[tokio::test] #[tokio::test]
async fn test_candle_harness_registers_and_rejects_bogus_model() { async fn test_models_via_mistralrs_harness() {
use axum::routing::{get, post};
use axum::{Json, Router};
use cortex_core::harness::HarnessConfig; use cortex_core::harness::HarnessConfig;
use neuron::config::HarnessSettings; use serde_json::Value;
let registry = HarnessRegistry::from_configs( // Mock mistral.rs backend.
&[HarnessConfig { let mock_app = Router::new()
name: "candle".into(), .route(
}], "/v1/models",
"http://localhost:13131", get(|| async {
&HarnessSettings::default(), Json(json!({
); "data": [
{"id": "test-model", "status": "loaded"},
{"id": "other-model", "status": "unloaded"}
]
}))
}),
)
.route(
"/v1/models/unload",
post(|Json(_body): Json<Value>| async { Json(json!({"status": "ok"})) }),
)
.route(
"/v1/models/reload",
post(|Json(_body): Json<Value>| async { Json(json!({"status": "ok"})) }),
);
let mock_listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let mock_addr = mock_listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(mock_listener, mock_app).await.unwrap();
});
let mock_url = format!("http://{mock_addr}");
// Build neuron with mistralrs harness pointing at mock.
let registry = HarnessRegistry::from_configs(&[HarnessConfig {
name: "mistralrs".into(),
endpoint: Some(mock_url.clone()),
systemd_unit: None,
}]);
let candle = registry.candle();
let health_cache = Arc::new(HealthCache::new()); let health_cache = Arc::new(HealthCache::new());
let state = Arc::new(NeuronState { let state = Arc::new(NeuronState {
discovery: fake_discovery(), discovery: fake_discovery(),
health_cache, health_cache,
registry: RwLock::new(registry), registry: RwLock::new(registry),
candle,
}); });
let app = api::neuron_routes().with_state(state); let app = api::neuron_routes().with_state(state);
@@ -172,6 +197,7 @@ async fn test_candle_harness_registers_and_rejects_bogus_model() {
let client = reqwest::Client::new(); let client = reqwest::Client::new();
// GET /models — should return models from mock mistralrs.
let resp = client let resp = client
.get(format!("{neuron_url}/models")) .get(format!("{neuron_url}/models"))
.send() .send()
@@ -179,140 +205,45 @@ async fn test_candle_harness_registers_and_rejects_bogus_model() {
.unwrap(); .unwrap();
assert_eq!(resp.status(), 200); assert_eq!(resp.status(), 200);
let models: Vec<serde_json::Value> = resp.json().await.unwrap(); let models: Vec<serde_json::Value> = resp.json().await.unwrap();
assert!(models.is_empty()); assert_eq!(models.len(), 2);
assert_eq!(models[0]["id"], "test-model");
assert_eq!(models[0]["harness"], "mistralrs");
assert_eq!(models[0]["status"], "loaded");
assert_eq!(models[1]["id"], "other-model");
assert_eq!(models[1]["status"], "unloaded");
// Sending a wrong-harness spec should be rejected synchronously // GET /models/test-model/endpoint — should return mock URL.
// without touching the network or the model registry. let resp = client
.get(format!("{neuron_url}/models/test-model/endpoint"))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body: serde_json::Value = resp.json().await.unwrap();
assert_eq!(body["url"], mock_url);
// POST /models/unload — should succeed.
let resp = client
.post(format!("{neuron_url}/models/unload"))
.json(&json!({"model_id": "test-model"}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body: serde_json::Value = resp.json().await.unwrap();
assert_eq!(body["status"], "unloaded");
// POST /models/load — should succeed.
let resp = client let resp = client
.post(format!("{neuron_url}/models/load")) .post(format!("{neuron_url}/models/load"))
.json(&json!({"model_id": "definitely/not-real", "harness": "not-candle"}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 400);
// Registry still empty.
let resp = client
.get(format!("{neuron_url}/models"))
.send()
.await
.unwrap();
let models: Vec<serde_json::Value> = resp.json().await.unwrap();
assert!(models.is_empty());
}
/// `/v1/chat/completions` returns 503 when no candle harness is registered.
#[tokio::test]
async fn test_chat_completions_no_candle_harness() {
let registry = HarnessRegistry::new();
let health_cache = Arc::new(HealthCache::new());
let state = Arc::new(NeuronState {
discovery: fake_discovery(),
health_cache,
registry: RwLock::new(registry),
candle: None,
});
let app = api::neuron_routes().with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let url = format!("http://{addr}");
let resp = reqwest::Client::new()
.post(format!("{url}/v1/chat/completions"))
.json(&json!({ .json(&json!({
"model": "anything", "model_id": "test-model",
"messages": [{"role": "user", "content": "hi"}] "harness": "mistralrs"
})) }))
.send() .send()
.await .await
.unwrap(); .unwrap();
assert_eq!(resp.status(), 503); assert_eq!(resp.status(), 200);
} let body: serde_json::Value = resp.json().await.unwrap();
assert_eq!(body["status"], "loaded");
/// `/v1/chat/completions` returns 404 when the requested model isn't loaded.
#[tokio::test]
async fn test_chat_completions_model_not_loaded() {
use cortex_core::harness::HarnessConfig;
use neuron::config::HarnessSettings;
let registry = HarnessRegistry::from_configs(
&[HarnessConfig {
name: "candle".into(),
}],
"http://localhost:0",
&HarnessSettings::default(),
);
let candle = registry.candle();
let health_cache = Arc::new(HealthCache::new());
let state = Arc::new(NeuronState {
discovery: fake_discovery(),
health_cache,
registry: RwLock::new(registry),
candle,
});
let app = api::neuron_routes().with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let url = format!("http://{addr}");
let resp = reqwest::Client::new()
.post(format!("{url}/v1/chat/completions"))
.json(&json!({
"model": "definitely/not-loaded",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 404);
}
/// `/v1/chat/completions` with `stream: true` returns 404 when the
/// model isn't loaded — same surface as the non-streaming path. The
/// streaming code only kicks in once the model lookup succeeds.
#[tokio::test]
async fn test_chat_completions_streaming_model_not_loaded() {
use cortex_core::harness::HarnessConfig;
use neuron::config::HarnessSettings;
let registry = HarnessRegistry::from_configs(
&[HarnessConfig {
name: "candle".into(),
}],
"http://localhost:0",
&HarnessSettings::default(),
);
let candle = registry.candle();
let health_cache = Arc::new(HealthCache::new());
let state = Arc::new(NeuronState {
discovery: fake_discovery(),
health_cache,
registry: RwLock::new(registry),
candle,
});
let app = api::neuron_routes().with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let url = format!("http://{addr}");
let resp = reqwest::Client::new()
.post(format!("{url}/v1/chat/completions"))
.json(&json!({
"model": "definitely/not-loaded",
"messages": [{"role": "user", "content": "hi"}],
"stream": true
}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 404);
} }

View File

@@ -1,87 +0,0 @@
//! Real model load/unload lifecycle through the candle harness.
//!
//! Gated behind the `cuda-integration` feature because it downloads a
//! real (small) GGUF from HuggingFace and materialises tensors on the
//! configured device. Run on a host with network access and either a
//! CUDA GPU (when built with `--features cuda`) or enough CPU RAM to
//! hold the model.
//!
//! Usage:
//! cargo test -p neuron --features cuda-integration --test candle_lifecycle
//!
//! Optional environment variables:
//! NEURON_TEST_MODEL_ID — HuggingFace repo to load (default: a small
//! public Qwen3 GGUF repo).
//! NEURON_TEST_QUANT — quant substring matched against GGUF
//! filenames (default: "Q4_K_M").
//! HF_HOME — HuggingFace cache directory.
#![cfg(feature = "cuda-integration")]
use cortex_core::harness::{HarnessConfig, ModelSpec};
use neuron::config::HarnessSettings;
use neuron::harness::HarnessRegistry;
use std::path::PathBuf;
#[tokio::test]
async fn test_candle_qwen3_load_unload_lifecycle() {
let _ = tracing_subscriber::fmt()
.with_test_writer()
.with_env_filter("info,neuron=debug")
.try_init();
let model_id = std::env::var("NEURON_TEST_MODEL_ID")
.unwrap_or_else(|_| "Qwen/Qwen3-0.6B-GGUF".to_string());
let quant = std::env::var("NEURON_TEST_QUANT").unwrap_or_else(|_| "Q4_K_M".to_string());
let mut settings = HarnessSettings::default();
if let Ok(home) = std::env::var("HF_HOME") {
settings.candle.hf_cache = Some(PathBuf::from(home));
}
let registry = HarnessRegistry::from_configs(
&[HarnessConfig {
name: "candle".into(),
}],
"http://localhost:13131",
&settings,
);
let spec = ModelSpec {
model_id: model_id.clone(),
harness: "candle".into(),
quant: Some(quant),
tensor_parallel: None,
devices: Some(vec![0]),
};
registry
.load_model(&spec)
.await
.expect("load_model should succeed");
let models = registry.list_all_models().await.expect("list_all_models");
assert_eq!(models.len(), 1, "expected exactly one loaded model");
assert_eq!(models[0].id, model_id);
assert_eq!(models[0].harness, "candle");
assert_eq!(models[0].status, "loaded");
let url = registry.inference_endpoint(&model_id).await;
assert_eq!(url, Some("http://localhost:13131".into()));
// Re-loading the same model should be rejected.
let again = registry.load_model(&spec).await;
assert!(again.is_err(), "second load should error");
registry
.unload_model(&model_id)
.await
.expect("unload_model should succeed");
let models = registry.list_all_models().await.expect("list_all_models");
assert!(models.is_empty(), "registry should be empty after unload");
// Unloading a model that isn't loaded should error.
let err = registry.unload_model(&model_id).await;
assert!(err.is_err(), "unload of missing model should error");
}

View File

@@ -1,32 +0,0 @@
//! Deactivation behaviour: unload_all_models tolerates an empty
//! registry and continues past per-model unload failures.
use cortex_core::harness::HarnessConfig;
use neuron::config::HarnessSettings;
use neuron::harness::HarnessRegistry;
use neuron::startup;
#[tokio::test]
async fn test_unload_all_models_empty_registry_is_noop() {
let registry = HarnessRegistry::new();
startup::unload_all_models(&registry).await;
}
#[tokio::test]
async fn test_unload_all_models_with_no_loaded_models() {
let registry = HarnessRegistry::from_configs(
&[HarnessConfig {
name: "candle".into(),
}],
"http://localhost:0",
&HarnessSettings::default(),
);
startup::unload_all_models(&registry).await;
let listed = registry
.list_all_models()
.await
.expect("list_all_models should still succeed after shutdown cleanup");
assert!(listed.is_empty());
}

View File

@@ -1,7 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<service>
<short>cortex</short>
<description>Cortex — inference gateway for multi-node GPU clusters</description>
<port protocol="tcp" port="31313"/>
<port protocol="tcp" port="31314"/>
</service>

View File

@@ -1,6 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<service>
<short>helexa-neuron</short>
<description>Neuron — per-node GPU discovery and harness daemon for cortex</description>
<port protocol="tcp" port="13131"/>
</service>

View File

@@ -10,22 +10,6 @@ Restart=on-failure
RestartSec=5 RestartSec=5
User=neuron User=neuron
Group=neuron Group=neuron
# /var/lib/neuron is the neuron user's $HOME — hf-hub writes its
# default cache there (~/.cache/huggingface/hub). Without this directive
# systemd doesn't create the directory and hf-hub downloads fail with
# "fetch GGUF <file>: failed to create cache dir".
StateDirectory=neuron
StateDirectoryMode=0755
# Loading default_models from neuron.toml happens before the HTTP
# listener binds; large models can take many minutes to download and
# materialise on first activation. systemd's default TimeoutStartSec
# (90s) is far too short; allow 30 minutes.
TimeoutStartSec=1800s
# On stop, neuron drains in-flight requests then unloads every model
# to release CUDA contexts cleanly. Allow generous time for big-model
# unloads; systemd will SIGKILL after this bound.
TimeoutStopSec=120s
KillSignal=SIGTERM
[Install] [Install]
WantedBy=multi-user.target WantedBy=multi-user.target

View File

@@ -6,7 +6,7 @@
[[models]] [[models]]
id = "your-org/large-model" id = "your-org/large-model"
harness = "candle" harness = "mistralrs"
quant = "Q4_K_M" quant = "Q4_K_M"
vram_mb = 19000 vram_mb = 19000
min_devices = 2 min_devices = 2
@@ -15,7 +15,7 @@ pinned_on = ["gpu-large"]
[[models]] [[models]]
id = "your-org/medium-model" id = "your-org/medium-model"
harness = "candle" harness = "mistralrs"
quant = "Q6_K" quant = "Q6_K"
vram_mb = 12000 vram_mb = 12000
min_devices = 1 min_devices = 1
@@ -23,7 +23,7 @@ pinned_on = ["gpu-medium"]
[[models]] [[models]]
id = "your-org/embedding-model" id = "your-org/embedding-model"
harness = "candle" harness = "mistralrs"
quant = "Q8_0" quant = "Q8_0"
vram_mb = 8000 vram_mb = 8000
min_devices = 1 min_devices = 1

View File

@@ -3,38 +3,14 @@
# Copy to /etc/neuron/neuron.toml and adjust for your environment. # Copy to /etc/neuron/neuron.toml and adjust for your environment.
# #
# Environment variable overrides use NEURON_ prefix with __ separators: # Environment variable overrides use NEURON_ prefix with __ separators:
# NEURON_PORT=13131 # NEURON_PORT=9090
port = 13131 port = 9090
# -- Harnesses --------------------------------------------------------------- # -- Harnesses ---------------------------------------------------------------
# Each [[harnesses]] entry enables an inference engine. Currently only # Each [[harnesses]] entry declares an inference engine managed by neuron.
# "candle" is supported — it runs in-process and uses huggingface/candle
# for inference on local CUDA devices (or CPU when CUDA is unavailable).
[[harnesses]] [[harnesses]]
name = "candle" name = "mistralrs"
endpoint = "http://localhost:8080"
# -- Candle harness settings ------------------------------------------------- systemd_unit = "mistralrs.service"
# Optional tuning for the candle harness.
[harness.candle]
# HuggingFace cache directory for model weights. When unset, hf-hub's
# default (~/.cache/huggingface) is used.
# hf_cache = "/var/lib/neuron/hf-cache"
# -- Default models ----------------------------------------------------------
# Models listed here are loaded automatically when the neuron service
# activates. Loading is sequential — a slow or failing entry doesn't
# block the rest of the fleet, but it does push out the time before
# neuron starts serving HTTP, so keep the list short. Operators can
# load additional models on demand via POST /models/load.
#
# Make sure data/neuron.service's TimeoutStartSec is generous enough to
# cover the slowest entry's first-time download + materialisation.
# [[default_models]]
# model_id = "Qwen/Qwen3-0.6B-GGUF"
# harness = "candle"
# quant = "Q4_K_M"
# devices = [0]

View File

@@ -1,10 +1,7 @@
Name: helexa-neuron Name: neuron
Version: 0.1.16 Version: 0.1.8
Release: 1%{?dist} Release: 1%{?dist}
Summary: Per-node GPU discovery and harness management daemon for cortex Summary: Per-node GPU discovery and harness management daemon for cortex
# Package name disambiguates from Fedora's existing "neuron" package
# (NEURON neural simulation environment from Yale). Binary, systemd
# unit, and system user are still called "neuron" for brevity.
License: GPL-3.0-or-later License: GPL-3.0-or-later
URL: https://git.lair.cafe/helexa/cortex URL: https://git.lair.cafe/helexa/cortex
@@ -24,7 +21,6 @@ BuildRequires: systemd-rpm-macros
Requires(pre): shadow-utils Requires(pre): shadow-utils
Requires: systemd Requires: systemd
Requires: firewalld-filesystem
# systemd-rpm-macros ships a unit dep generator that parses User=/Group= # systemd-rpm-macros ships a unit dep generator that parses User=/Group=
# from our .service file and emits Requires: user(neuron)/group(neuron). # from our .service file and emits Requires: user(neuron)/group(neuron).
@@ -37,9 +33,8 @@ Provides: user(neuron)
%description %description
Neuron is a per-node daemon for cortex inference clusters. It discovers Neuron is a per-node daemon for cortex inference clusters. It discovers
local GPU hardware via nvidia-smi, runs in-process inference via local GPU hardware via nvidia-smi, manages inference harnesses (mistral.rs,
huggingface/candle, and exposes an HTTP API for model lifecycle llama.cpp), and exposes an HTTP API for model lifecycle management.
management (load, unload, list, inference endpoint).
%prep %prep
%autosetup %autosetup
@@ -60,7 +55,6 @@ cargo build --release -p neuron
install -Dm755 target/release/neuron %{buildroot}%{_bindir}/neuron install -Dm755 target/release/neuron %{buildroot}%{_bindir}/neuron
install -Dm644 data/neuron.service %{buildroot}%{_unitdir}/neuron.service install -Dm644 data/neuron.service %{buildroot}%{_unitdir}/neuron.service
install -Dm644 data/neuron-sysusers.conf %{buildroot}%{_sysusersdir}/neuron.conf install -Dm644 data/neuron-sysusers.conf %{buildroot}%{_sysusersdir}/neuron.conf
install -Dm644 data/neuron-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/helexa-neuron.xml
install -dm755 %{buildroot}%{_sysconfdir}/neuron install -dm755 %{buildroot}%{_sysconfdir}/neuron
install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml
@@ -82,20 +76,9 @@ install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml
%{_bindir}/neuron %{_bindir}/neuron
%{_unitdir}/neuron.service %{_unitdir}/neuron.service
%{_sysusersdir}/neuron.conf %{_sysusersdir}/neuron.conf
%{_prefix}/lib/firewalld/services/helexa-neuron.xml
%dir %{_sysconfdir}/neuron %dir %{_sysconfdir}/neuron
%config(noreplace) %{_sysconfdir}/neuron/neuron.toml %config(noreplace) %{_sysconfdir}/neuron/neuron.toml
%changelog %changelog
* Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.16-1
- chore: ignore local deploy script
- chore: move default ports out of common-collision ranges
- ci: drop actions/cache for cargo registry and target
* Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.14-1
- ci: publish both packages to a single helexa/helexa COPR project
- fix(rpm): rename neuron package to helexa-neuron
- ci: commit generated %changelog entries back to main
* Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1 * Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1
- Initial package - Initial package

View File

@@ -1,102 +0,0 @@
# Prebuilt-binary spec for cortex.
#
# Unlike cortex.spec (which builds from source via cargo), this spec
# wraps a pre-built `cortex` binary produced by an upstream CI job and
# packages it for rpm.lair.cafe. The %build phase is a no-op.
#
# Required defines at rpmbuild time:
# cortex_version e.g. "0.1.16"
# cortex_prerelease e.g. "0.1.20260518gitabcdef0" (used as Release)
%global _build_id_links none
%global debug_package %{nil}
%global __strip /usr/bin/true
%{!?cortex_version: %global cortex_version 0.0.0}
%if 0%{?cortex_prerelease:1}
%global cortex_release %{cortex_prerelease}
%else
%global cortex_release 1
%endif
Name: cortex
Version: %{cortex_version}
Release: %{cortex_release}%{?dist}
Summary: Inference gateway for multi-node GPU clusters (prebuilt)
License: GPL-3.0-or-later
URL: https://git.lair.cafe/helexa/cortex
Source0: cortex
Source1: cortex.service
Source2: cortex-sysusers.conf
Source3: cortex-firewalld.xml
Source4: cortex.example.toml
Source5: models.example.toml
Source6: LICENSE
ExclusiveArch: x86_64
Requires(pre): shadow-utils
Requires: systemd
Requires: firewalld-filesystem
Provides: user(cortex)
%description
Cortex is a Rust reverse-proxy that sits in front of multiple neuron
inference daemons and presents a unified OpenAI and Anthropic
compatible API surface.
This package wraps a binary built upstream in CI; the source-build
spec (cortex.spec) remains available for stable releases.
%prep
cp %{SOURCE0} ./cortex
cp %{SOURCE1} .
cp %{SOURCE2} .
cp %{SOURCE3} .
cp %{SOURCE4} .
cp %{SOURCE5} .
cp %{SOURCE6} .
%build
# Already built in the upstream CI build job.
%install
install -Dm755 cortex %{buildroot}%{_bindir}/cortex
install -Dm644 cortex.service %{buildroot}%{_unitdir}/cortex.service
install -Dm644 cortex-sysusers.conf %{buildroot}%{_sysusersdir}/cortex.conf
install -Dm644 cortex-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/cortex.xml
install -dm755 %{buildroot}%{_sysconfdir}/cortex
install -Dm644 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml
install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
%pre
getent group cortex >/dev/null || groupadd -r cortex
getent passwd cortex >/dev/null || \
useradd -r -g cortex -d /var/lib/cortex -s /sbin/nologin \
-c "Cortex inference gateway" cortex
%post
%systemd_post cortex.service
%preun
%systemd_preun cortex.service
%postun
%systemd_postun_with_restart cortex.service
%files
%license LICENSE
%{_bindir}/cortex
%{_unitdir}/cortex.service
%{_sysusersdir}/cortex.conf
%{_prefix}/lib/firewalld/services/cortex.xml
%dir %{_sysconfdir}/cortex
%config(noreplace) %{_sysconfdir}/cortex/cortex.toml
%config(noreplace) %{_sysconfdir}/cortex/models.toml
%changelog
* Mon May 18 2026 Gitea Actions <actions@git.lair.cafe> - %{cortex_version}-%{cortex_release}
- Prerelease build from upstream CI binary.

View File

@@ -1,122 +0,0 @@
# Prebuilt-binary spec for helexa-neuron flavoured by CUDA compute capability.
#
# Unlike helexa-neuron.spec (which builds from source via cargo), this
# spec wraps a pre-built `neuron-{flavour}` binary produced by an
# upstream CI job and packages it for rpm.lair.cafe. The %build phase
# is a no-op.
#
# Required defines at rpmbuild time:
# neuron_version e.g. "0.1.16"
# neuron_flavour e.g. "ada", "blackwell" — matches the CI build
# matrix's compute_cap label.
# neuron_prerelease e.g. "0.1.20260518gitabcdef0" (used as Release)
#
# One flavour can be installed at a time on a given host; flavour
# packages Conflict with each other.
%global _build_id_links none
%global debug_package %{nil}
%global __strip /usr/bin/true
%{!?neuron_version: %global neuron_version 0.0.0}
%{!?neuron_flavour: %global neuron_flavour blackwell}
%if 0%{?neuron_prerelease:1}
%global neuron_release %{neuron_prerelease}
%else
%global neuron_release 1
%endif
Name: helexa-neuron-%{neuron_flavour}
Version: %{neuron_version}
Release: %{neuron_release}%{?dist}
Summary: Per-node GPU inference daemon (candle, %{neuron_flavour} flavour)
License: GPL-3.0-or-later
URL: https://git.lair.cafe/helexa/cortex
Source0: neuron-%{neuron_flavour}
Source1: neuron.service
Source2: neuron-sysusers.conf
Source3: neuron-firewalld.xml
Source4: neuron.example.toml
Source5: LICENSE
ExclusiveArch: x86_64
# Binary links against the CUDA runtime, cuDNN, NCCL, etc. Suppress
# auto-detected exact soname deps — users may have CUDA from various
# sources (rpmfusion, nvidia-direct) at different compatible versions;
# a runtime dlopen failure surfaces a clearer error than rpm dep
# resolution would.
%global __requires_exclude ^lib(cuda|cudart|cudnn|cublas|cublasLt|curand|nvrtc|nccl)
Requires(pre): shadow-utils
Requires: systemd
Requires: firewalld-filesystem
Provides: helexa-neuron = %{neuron_version}-%{neuron_release}
Provides: user(neuron)
# Mutual exclusion across flavours and the source-build variant.
Conflicts: helexa-neuron
Conflicts: helexa-neuron-ada
Conflicts: helexa-neuron-ampere
Conflicts: helexa-neuron-blackwell
# (The Conflicts: with self is filtered by rpm at install time.)
%description
Neuron is the per-node daemon for cortex inference clusters. It
discovers local GPU hardware via nvidia-smi, runs in-process
inference via huggingface/candle, and exposes an HTTP API for model
lifecycle management (load, unload, list, inference endpoint).
This is the %{neuron_flavour} flavour, built for that CUDA compute
capability. Install the flavour matching the GPUs on this host.
%prep
cp %{SOURCE0} ./neuron
cp %{SOURCE1} .
cp %{SOURCE2} .
cp %{SOURCE3} .
cp %{SOURCE4} .
cp %{SOURCE5} .
%build
# Already built in the upstream CI build job (with --features cuda).
%install
install -Dm755 neuron %{buildroot}%{_bindir}/neuron
install -Dm644 neuron.service %{buildroot}%{_unitdir}/neuron.service
install -Dm644 neuron-sysusers.conf %{buildroot}%{_sysusersdir}/neuron.conf
install -Dm644 neuron-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/helexa-neuron.xml
install -dm755 %{buildroot}%{_sysconfdir}/neuron
install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml
%pre
getent group neuron >/dev/null || groupadd -r neuron
getent passwd neuron >/dev/null || \
useradd -r -g neuron -d /var/lib/neuron -s /sbin/nologin \
-G video,render \
-c "Neuron GPU node daemon" neuron
%post
%systemd_post neuron.service
%preun
%systemd_preun neuron.service
%postun
%systemd_postun_with_restart neuron.service
%files
%license LICENSE
%{_bindir}/neuron
%{_unitdir}/neuron.service
%{_sysusersdir}/neuron.conf
%{_prefix}/lib/firewalld/services/helexa-neuron.xml
%dir %{_sysconfdir}/neuron
%config(noreplace) %{_sysconfdir}/neuron/neuron.toml
%changelog
* Mon May 18 2026 Gitea Actions <actions@git.lair.cafe> - %{neuron_version}-%{neuron_release}
- Prerelease build from upstream CI binary (%{neuron_flavour} flavour).

View File

@@ -1 +0,0 @@
%_openpgp_sign_id @GPG_NAME@

View File

@@ -1,195 +0,0 @@
#!/bin/env bash
#
# Rolling deploy across the helexa fleet, driven by asset/manifest.yml.
# Installs / upgrades cortex on the gateway host and the appropriate
# helexa-neuron-<flavour> package on each neuron host, then restarts
# their services.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
MANIFEST="${REPO_DIR}/asset/manifest.yml"
if [[ ! -f "${MANIFEST}" ]]; then
echo "fatal: manifest not found at ${MANIFEST}" >&2
exit 1
fi
# Parse the manifest with yq. NOTE: this expects the pip-installed yq
# (a jq wrapper using jq syntax) — `pip install yq`. The Fedora rpm
# `yq` is mikefarah/yq and uses different (yaml-native) syntax; if a
# host has that one instead these queries will fail.
cortex_host=$(yq -r '.cortex.host' "${MANIFEST}")
# Emit one TAB-separated 'host\tflavour' line per neuron.
mapfile -t neuron_entries < <(
yq -r '.neurons[] | .host + "\t" + .flavour' "${MANIFEST}"
)
# Return the installed package's "version-release" string, or
# "(not installed)" when rpm reports the package as absent. Capture
# rpm's output into a variable so its "package X is not installed"
# stdout message (rpm writes that to stdout, not stderr, when -q fails)
# doesn't leak into the result.
installed_nvr() {
local host="$1" pkg="$2"
local nvr
if nvr=$(ssh "${host}" "rpm -q --qf '%{version}-%{release}' ${pkg} 2>/dev/null"); then
echo "${nvr}"
else
echo "(not installed)"
fi
}
# Ensure the rpm.lair.cafe unstable repo is configured AND enabled on
# the remote host.
#
# The upstream .repo file at https://rpm.lair.cafe/lair-cafe-unstable.repo
# ships with `enabled=0` so a host that just fetched it won't start
# pulling unstable packages by accident. We have to explicitly flip
# enabled=1 via `dnf config-manager setopt`. Both addrepo and setopt
# are idempotent.
#
# Non-fatal — if either step fails the subsequent `dnf install` will
# surface a clearer diagnostic on its own.
ensure_lair_repo() {
local host="$1"
if ! ssh "${host}" "test -f /etc/yum.repos.d/lair-cafe-unstable.repo" 2>/dev/null; then
echo "[${host}] adding rpm.lair.cafe unstable repo"
if ! ssh "${host}" sudo dnf config-manager addrepo \
--from-repofile=https://rpm.lair.cafe/lair-cafe-unstable.repo \
>/dev/null 2>&1; then
echo "[${host}] WARNING: failed to add lair.cafe repo file (proceeding anyway)"
return 0
fi
fi
# The .repo file ships enabled=0; flip it on. Cheap, idempotent.
if ! ssh "${host}" sudo dnf config-manager setopt \
lair-cafe-unstable.enabled=1 >/dev/null 2>&1; then
echo "[${host}] WARNING: failed to enable lair-cafe-unstable (proceeding anyway)"
fi
}
# True when the named package needs to be installed or upgraded on the
# remote host — either it's not present, or a newer version exists in
# the repo. False only when the installed version is current.
#
# `dnf check-update <pkg>` returns 0 when the package isn't installed
# at all (there's nothing to update), so we have to probe with rpm -q
# first to distinguish "absent" from "current". Other dnf failures
# collapse into "needs update" so the subsequent install step surfaces
# the real diagnostic rather than this check swallowing it.
needs_update() {
local host="$1" pkg="$2"
# Not installed → needs work.
if ! ssh "${host}" "rpm -q ${pkg}" >/dev/null 2>&1; then
return 0
fi
# Installed; ask dnf whether the repo has something newer.
if ssh "${host}" sudo dnf check-update --refresh -q "${pkg}" >/dev/null 2>&1; then
return 1
else
return 0
fi
}
# ---------------------------------------------------------------------------
# cortex (gateway)
# ---------------------------------------------------------------------------
ensure_lair_repo "${cortex_host}"
cortex_nvr=$(installed_nvr "${cortex_host}" cortex)
if needs_update "${cortex_host}" cortex; then
echo "[${cortex_host}] cortex update available (current: ${cortex_nvr})"
# Stop the service only if the unit file exists — fresh installs
# don't have it, and `systemctl stop` on a missing unit returns
# non-zero, which would otherwise short-circuit the install branch
# under set -e.
if ssh "${cortex_host}" "[ ! -f /usr/lib/systemd/system/cortex.service ] || sudo systemctl stop cortex.service"; then
echo "[${cortex_host}] stopped cortex service"
if dnf_output=$(ssh "${cortex_host}" sudo dnf install --refresh --allowerasing -y cortex 2>&1); then
cortex_nvr=$(installed_nvr "${cortex_host}" cortex)
echo "[${cortex_host}] installed/upgraded cortex to ${cortex_nvr}"
else
echo "[${cortex_host}] failed to install/upgrade cortex:"
echo "${dnf_output}" | sed "s/^/[${cortex_host}] /"
fi
else
echo "[${cortex_host}] failed to stop cortex service"
fi
else
echo "[${cortex_host}] cortex is up to date (${cortex_nvr})"
ssh "${cortex_host}" sudo systemctl stop cortex.service || true
fi
# Sync cortex.toml whether the package was upgraded or not — the config
# can change without a package bump.
if rsync \
--archive \
--compress \
--rsync-path 'sudo rsync' \
--chown root:root \
--chmod 644 \
"${REPO_DIR}/cortex.toml" \
"${cortex_host}:/etc/cortex/cortex.toml"; then
echo "[${cortex_host}] sync'd cortex.toml"
else
echo "[${cortex_host}] failed to sync cortex.toml"
fi
ssh "${cortex_host}" sudo systemctl daemon-reload
if ssh "${cortex_host}" systemctl is-active --quiet cortex.service; then
echo "[${cortex_host}] cortex service is active"
elif ssh "${cortex_host}" sudo systemctl start cortex.service; then
echo "[${cortex_host}] started cortex service"
else
echo "[${cortex_host}] failed to start cortex service"
fi
# ---------------------------------------------------------------------------
# neuron (per-host, flavour from manifest)
# ---------------------------------------------------------------------------
for entry in "${neuron_entries[@]}"; do
IFS=$'\t' read -r neuron_host neuron_flavour <<< "${entry}"
package="helexa-neuron-${neuron_flavour}"
ensure_lair_repo "${neuron_host}"
neuron_nvr=$(installed_nvr "${neuron_host}" "${package}")
if needs_update "${neuron_host}" "${package}"; then
echo "[${neuron_host}] ${package} update available (current: ${neuron_nvr})"
if ssh "${neuron_host}" "[ ! -f /usr/lib/systemd/system/neuron.service ] || sudo systemctl stop neuron.service"; then
echo "[${neuron_host}] stopped neuron service"
# --allowerasing lets dnf swap out a previously-installed
# bare helexa-neuron or a different flavour without manual
# intervention. The Conflicts: clauses in the spec ensure
# only one flavour is ever resident.
if dnf_output=$(ssh "${neuron_host}" sudo dnf install --refresh --allowerasing -y "${package}" 2>&1); then
neuron_nvr=$(installed_nvr "${neuron_host}" "${package}")
echo "[${neuron_host}] installed/upgraded ${package} to ${neuron_nvr}"
# Ensure firewalld allows neuron port
ssh "${neuron_host}" "sudo firewall-cmd --query-service=helexa-neuron --quiet 2>/dev/null || sudo firewall-cmd --add-service=helexa-neuron --permanent && sudo firewall-cmd --reload" 2>/dev/null || true
if ssh "${neuron_host}" "sudo systemctl daemon-reload && sudo systemctl start neuron.service"; then
echo "[${neuron_host}] started neuron service"
else
echo "[${neuron_host}] failed to start neuron service"
fi
else
echo "[${neuron_host}] failed to install ${package}:"
echo "${dnf_output}" | sed "s/^/[${neuron_host}] /"
fi
else
echo "[${neuron_host}] failed to stop neuron service"
fi
else
echo "[${neuron_host}] ${package} is up to date (${neuron_nvr})"
if ssh "${neuron_host}" systemctl is-active --quiet neuron.service; then
echo "[${neuron_host}] neuron service is active"
elif ssh "${neuron_host}" sudo systemctl start neuron.service; then
echo "[${neuron_host}] started neuron service"
else
echo "[${neuron_host}] failed to start neuron service"
fi
fi
done

View File

@@ -1,154 +0,0 @@
#!/usr/bin/env python3
"""Parse RPM repodata and emit a packages.json manifest for the UI."""
import argparse
import gzip
import json
import os
import subprocess
import sys
import xml.etree.ElementTree as ET
from datetime import datetime, timezone
RPM_NS = "http://linux.duke.edu/metadata/common"
OTHER_NS = "http://linux.duke.edu/metadata/other"
REPO_NS = "http://linux.duke.edu/metadata/repo"
def find_repodata_file(repodata_dir, data_type):
"""Read repomd.xml and return the path to a specific data type's file."""
repomd_path = os.path.join(repodata_dir, "repomd.xml")
tree = ET.parse(repomd_path)
root = tree.getroot()
for data in root.findall(f"{{{REPO_NS}}}data"):
if data.get("type") == data_type:
location = data.find(f"{{{REPO_NS}}}location")
if location is not None:
href = location.get("href", "")
return os.path.join(os.path.dirname(repodata_dir), href)
return None
def open_compressed(path):
"""Open a gzip or zstd compressed file for reading."""
if path.endswith(".zst"):
result = subprocess.run(
["zstdcat", path], capture_output=True, check=True
)
import io
return io.BytesIO(result.stdout)
else:
return gzip.open(path, "rb")
def parse_primary(repodata_dir):
"""Parse primary.xml.{gz,zst} and return package metadata."""
path = find_repodata_file(repodata_dir, "primary")
if not path:
print("error: primary metadata not found in repomd.xml", file=sys.stderr)
sys.exit(1)
packages = {}
with open_compressed(path) as f:
tree = ET.parse(f)
for pkg in tree.getroot().findall(f"{{{RPM_NS}}}package"):
if pkg.get("type") != "rpm":
continue
name = pkg.findtext(f"{{{RPM_NS}}}name", "")
version_el = pkg.find(f"{{{RPM_NS}}}version")
ver = version_el.get("ver", "") if version_el is not None else ""
rel = version_el.get("rel", "") if version_el is not None else ""
arch = pkg.findtext(f"{{{RPM_NS}}}arch", "")
size_el = pkg.find(f"{{{RPM_NS}}}size")
size = int(size_el.get("package", "0")) if size_el is not None else 0
time_el = pkg.find(f"{{{RPM_NS}}}time")
build_time = int(time_el.get("build", "0")) if time_el is not None else 0
location_el = pkg.find(f"{{{RPM_NS}}}location")
filename = os.path.basename(location_el.get("href", "")) if location_el is not None else ""
key = f"{name}-{ver}-{rel}"
packages[key] = {
"name": name,
"version": ver,
"release": rel,
"arch": arch,
"summary": pkg.findtext(f"{{{RPM_NS}}}summary", ""),
"size": size,
"buildTime": build_time,
"rpmFilename": filename,
"changelog": [],
}
return packages
def parse_other(repodata_dir, packages):
"""Parse other.xml.gz and attach changelog entries to packages."""
path = find_repodata_file(repodata_dir, "other")
if not path:
return
with open_compressed(path) as f:
tree = ET.parse(f)
for pkg in tree.getroot().findall(f"{{{OTHER_NS}}}package"):
name = pkg.get("name", "")
version_el = pkg.find(f"{{{OTHER_NS}}}version")
ver = version_el.get("ver", "") if version_el is not None else ""
rel = version_el.get("rel", "") if version_el is not None else ""
key = f"{name}-{ver}-{rel}"
if key not in packages:
continue
for entry in pkg.findall(f"{{{OTHER_NS}}}changelog"):
packages[key]["changelog"].append({
"author": entry.get("author", ""),
"date": int(entry.get("date", "0")),
"text": (entry.text or "").strip(),
})
def main():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--repodata-dir",
required=True,
help="path to the repodata/ directory",
)
parser.add_argument(
"--output",
required=True,
help="path to write packages.json",
)
parser.add_argument(
"--base-url",
required=True,
help="public base URL for the repo (e.g. https://rpm.lair.cafe/fedora/43/x86_64)",
)
args = parser.parse_args()
packages = parse_primary(args.repodata_dir)
parse_other(args.repodata_dir, packages)
manifest = {
"generated": datetime.now(timezone.utc).isoformat(),
"baseUrl": args.base_url,
"packages": list(packages.values()),
}
with open(args.output, "w") as f:
json.dump(manifest, f, indent=2)
print(f"wrote {len(packages)} packages to {args.output}")
if __name__ == "__main__":
main()

View File

@@ -1,141 +0,0 @@
#!/bin/env bash
#
# End-to-end smoke test for a deployed neuron.
#
# Confirms the daemon is reachable, loads a small public Qwen3 GGUF,
# fires a reasoning probe at /v1/chat/completions, and prints the
# answer. Used to validate the candle harness on a real GPU host
# before trusting it for production traffic, and as a regression test
# after pushing new neuron builds.
#
# Usage:
# script/validate-neuron.sh [host] [model_id] [quant]
#
# Defaults:
# host = beast.hanzalova.internal
# model_id = unsloth/Qwen3-0.6B-GGUF (official Qwen3-*-GGUF repos
# ship Q8_0 only; unsloth's mirror ships the full Q-spectrum
# including Q4_K_M)
# quant = Q4_K_M
set -euo pipefail
HOST="${1:-beast.hanzalova.internal}"
MODEL_ID="${2:-unsloth/Qwen3-0.6B-GGUF}"
QUANT="${3:-Q4_K_M}"
PORT="${NEURON_PORT:-13131}"
BASE="http://${HOST}:${PORT}"
# Reasoning probe — concrete, low-temperature answer that small models
# can still get right. "Paris" is a strong signal of basic competence
# beyond gibberish.
PROBE_PROMPT='What is the capital of France? Respond with the city name only, no punctuation.'
EXPECT_SUBSTR='Paris'
MAX_TOKENS=32
# /models/load is synchronous — neuron blocks the response until the
# hf-hub download + GGUF parse + tensor materialisation is done. A
# fresh 0.6B-Q4_K_M is ~400 MB; on a slow link or cold cache that's
# easily a minute. Pick a generous ceiling.
LOAD_TIMEOUT=600
INFER_TIMEOUT=120
say() { printf '[%s] %s\n' "${HOST}" "$*"; }
die() { say "FAIL: $*"; exit 1; }
probe_health() {
curl --silent --fail --max-time 5 "${BASE}/health" >/dev/null \
|| die "neuron not reachable at ${BASE}/health"
}
list_loaded_ids() {
curl --silent --fail "${BASE}/models" | yq -r '.[].id'
}
is_loaded() {
list_loaded_ids 2>/dev/null | grep -Fxq "${MODEL_ID}"
}
trigger_load() {
say "POST /models/load ${MODEL_ID} (quant=${QUANT}, device=[0])"
say " (synchronous; may take a minute on first run while HF downloads)"
local payload
payload=$(cat <<EOF
{
"model_id": "${MODEL_ID}",
"harness": "candle",
"quant": "${QUANT}",
"devices": [0]
}
EOF
)
# --write-out captures the response code on a separate line so we
# can surface a real diagnostic instead of relying on --fail.
local resp http_code body
resp=$(curl --silent --show-error --max-time "${LOAD_TIMEOUT}" \
--write-out '\n__HTTP__%{http_code}' \
-X POST "${BASE}/models/load" \
-H 'content-type: application/json' \
--data "${payload}") || die "curl /models/load failed: $?"
http_code=$(echo "${resp}" | grep -oP '(?<=__HTTP__)\d+$' | tail -1)
body=$(echo "${resp}" | sed '$ s/__HTTP__.*$//')
if [[ "${http_code}" != "200" ]]; then
die "load returned HTTP ${http_code}: ${body}"
fi
say "load returned ${http_code}: ${body}"
}
run_probe() {
say "POST /v1/chat/completions (probe: ${PROBE_PROMPT})"
local payload
payload=$(yq -n -c \
--arg model "${MODEL_ID}" \
--arg content "${PROBE_PROMPT}" \
--argjson tokens "${MAX_TOKENS}" \
'{
model: $model,
messages: [{role: "user", content: $content}],
temperature: 0.1,
max_tokens: $tokens
}')
local resp http_code body
resp=$(curl --silent --show-error --max-time "${INFER_TIMEOUT}" \
--write-out '\n__HTTP__%{http_code}' \
-X POST "${BASE}/v1/chat/completions" \
-H 'content-type: application/json' \
--data "${payload}") || die "curl /v1/chat/completions failed: $?"
http_code=$(echo "${resp}" | grep -oP '(?<=__HTTP__)\d+$' | tail -1)
body=$(echo "${resp}" | sed '$ s/__HTTP__.*$//')
if [[ "${http_code}" != "200" ]]; then
die "inference returned HTTP ${http_code}: ${body}"
fi
echo "${body}"
}
say "validating neuron at ${BASE}"
probe_health
say "/health OK"
if is_loaded; then
say "${MODEL_ID} already loaded"
else
trigger_load
fi
raw=$(run_probe)
echo "---"
echo "${raw}" | yq -r '.'
echo "---"
content=$(echo "${raw}" | yq -r '.choices[0].message.content // empty')
if [[ -z "${content}" ]]; then
die "no content in chat completion response"
fi
say "assistant said: ${content}"
if echo "${content}" | grep -qiF "${EXPECT_SUBSTR}"; then
say "PASS — response contains expected substring '${EXPECT_SUBSTR}'"
exit 0
else
die "response did not contain '${EXPECT_SUBSTR}'"
fi