fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires

Diagnosing the persistent "Nothing to do" on v0.1.10 surfaced that removing %attr(,,name) from %files wasn't enough. systemd-rpm-macros ships its own rpm dep generator (/usr/lib/rpm/systemd.req) that parses User=/Group= directives from every .service file the package ships and emits Requires: user(NAME)/group(NAME) accordingly. Rpmbuild log from v0.1.10 shows these Requires are still emitted even after the %attr removal. Meanwhile the sysusers provides-generator emits group(NAME) in both unversioned and versioned forms, but only a versioned user(NAME) = <base64> when the u-line has GECOS/home/shell fields. The asymmetry leaves Requires: user(NAME) unresolvable. Add explicit Provides: user(NAME) back to both specs, with a comment documenting the actual cause (systemd unit parsing, not file attrs) so the next person touching these specs doesn't repeat the mistake. Why monsoon didn't hit this: it creates its user in %pre via groupadd/useradd (not sysusers.d), so no Provides are generated at all — matching the Requires: user(monsoon) by luck of the rpm solver treating unknown symbols as soft-fails for that path. Ours went through the sysusers Provides code path and hit the asymmetry instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: migrate rpm changelog generation to reusable action
2026-04-16 15:30:55 +03:00 · 2026-04-16 15:23:45 +03:00 · 2026-04-16 15:04:36 +03:00 · 2026-04-16 14:58:40 +03:00
45 changed files with 531 additions and 4300 deletions
--- a/.gitea/workflows/build-prerelease.yml
+++ b/.gitea/workflows/build-prerelease.yml
@@ -1,326 +0,0 @@
 name: build-prerelease
 # Manually-dispatched workflow that builds CUDA-flavoured neuron binaries
 # (and a single cortex binary), packages each as a Fedora RPM, signs
 # them, and publishes to the `unstable` channel at rpm.lair.cafe.
 #
 # Trigger from the Gitea UI: Actions → build-prerelease → Run workflow.
 # Optionally provide a `ref` to build from a non-default branch.
 #
 # The published packages are versioned as e.g.
 #   helexa-neuron-blackwell-0.1.16-0.1.20260518gitabcdef0.fc43.x86_64
 # so they sort BELOW the eventual 0.1.16-1 stable release.
 on:
  # Auto-build on every push to main so the unstable channel tracks
  # head without a manual dispatch step.
  push:
    branches: [main]
  # Manual dispatch still available to build from a non-main ref.
  workflow_dispatch:
    inputs:
      ref:
        description: "Git ref to build (branch / tag / commit). Defaults to the workflow's branch."
        required: false
        default: ""
 concurrency:
  # Coalesce on branch+event so successive pushes don't pile up; the
  # latest push wins.
  group: prerelease-build-${{ github.ref }}
  cancel-in-progress: true
 env:
  CARGO_INCREMENTAL: "0"
 jobs:
  prepare:
    name: Resolve version stamps
    runs-on: rust
    outputs:
      version: ${{ steps.info.outputs.version }}
      release: ${{ steps.info.outputs.release }}
      short_sha: ${{ steps.info.outputs.short_sha }}
      commit_date: ${{ steps.info.outputs.commit_date }}
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref }}
          fetch-depth: 0
      - id: info
        run: |
          set -eux
          VERSION=$(awk -F\" '/^version[[:space:]]*=/ { print $2; exit }' Cargo.toml)
          SHORT_SHA=$(git rev-parse --short=7 HEAD)
          COMMIT_DATE=$(git log -1 --format=%cd --date=format:%Y%m%d HEAD)
          # Prerelease release stamp sorts before "1" (the stable release).
          RELEASE="0.1.${COMMIT_DATE}git${SHORT_SHA}"
          echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
          echo "release=${RELEASE}" >> "$GITHUB_OUTPUT"
          echo "short_sha=${SHORT_SHA}" >> "$GITHUB_OUTPUT"
          echo "commit_date=${COMMIT_DATE}" >> "$GITHUB_OUTPUT"
  build-cortex:
    name: Build cortex binary
    needs: prepare
    # runner-rust image already provides rust/cargo/clippy/rustfmt via
    # dnf — no rustup install step needed.
    runs-on: rust
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref }}
      - name: Build cortex (release)
        run: cargo build --release -p cortex-cli
      - name: Stage binary
        run: |
          mkdir --parents artifacts
          cp target/release/cortex artifacts/cortex
          ./artifacts/cortex --version || true
      - uses: actions/upload-artifact@v3
        with:
          name: cortex-fc43
          path: artifacts/cortex
          retention-days: 1
  build-neuron:
    name: Build neuron-${{ matrix.flavour }}
    needs: prepare
    strategy:
      fail-fast: false
      matrix:
        include:
          - flavour: ampere
            compute_cap: "86"
            runner: cuda-13.0
            cuda_home: /usr/local/cuda-13.0
            build_jobs: 8
            nvcc_threads: 4
            cargo_features: "cuda cudnn flash-attn"
          - flavour: ada
            compute_cap: "89"
            runner: cuda-13.0
            cuda_home: /usr/local/cuda-13.0
            build_jobs: 8
            nvcc_threads: 4
            cargo_features: "cuda cudnn flash-attn"
          - flavour: blackwell
            compute_cap: "120"
            runner: cuda-13.0
            cuda_home: /usr/local/cuda-13.0
            build_jobs: 8
            nvcc_threads: 4
            cargo_features: "cuda cudnn flash-attn"
    runs-on: ${{ matrix.runner }}
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref }}
      - name: Build neuron with CUDA (${{ matrix.flavour }})
        run: |
          set -eux
          export PATH="${{ matrix.cuda_home }}/bin:${PATH}"
          export LD_LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LD_LIBRARY_PATH:-}"
          export LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LIBRARY_PATH:-}"
          cargo build --release -p neuron --features "${{ matrix.cargo_features }}"
        env:
          CUDA_COMPUTE_CAP: ${{ matrix.compute_cap }}
          CARGO_BUILD_JOBS: ${{ matrix.build_jobs }}
          NVCC_THREADS: ${{ matrix.nvcc_threads }}
      - name: Stage binary
        run: |
          mkdir --parents artifacts
          cp target/release/neuron artifacts/neuron-${{ matrix.flavour }}
          file "artifacts/neuron-${{ matrix.flavour }}"
      - uses: actions/upload-artifact@v3
        with:
          name: neuron-${{ matrix.flavour }}-fc43
          path: artifacts/neuron-${{ matrix.flavour }}
          retention-days: 1
  package-cortex:
    name: Package cortex RPM
    needs: [prepare, build-cortex]
    runs-on: rpm
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref }}
      - uses: actions/download-artifact@v3
        with:
          name: cortex-fc43
          path: artifacts/
      - name: Build RPM
        run: |
          set -eux
          rm -f ~/.rpmmacros
          rpmdev-setuptree
          cp artifacts/cortex ~/rpmbuild/SOURCES/
          cp data/cortex.service ~/rpmbuild/SOURCES/
          cp data/cortex-sysusers.conf ~/rpmbuild/SOURCES/
          cp data/cortex-firewalld.xml ~/rpmbuild/SOURCES/
          cp cortex.example.toml ~/rpmbuild/SOURCES/
          cp models.example.toml ~/rpmbuild/SOURCES/
          cp LICENSE ~/rpmbuild/SOURCES/
          rpmbuild -bb rpm/cortex-prerelease.spec \
            --define "cortex_version ${{ needs.prepare.outputs.version }}" \
            --define "cortex_prerelease ${{ needs.prepare.outputs.release }}" \
            --undefine dist \
            --define "dist .fc43"
      - uses: actions/upload-artifact@v3
        with:
          name: rpm-cortex-fc43
          path: ~/rpmbuild/RPMS/x86_64/*.rpm
          retention-days: 7
  package-neuron:
    name: Package helexa-neuron-${{ matrix.flavour }} RPM
    needs: [prepare, build-neuron]
    runs-on: rpm
    strategy:
      fail-fast: false
      matrix:
        include:
          - flavour: ampere
          - flavour: ada
          - flavour: blackwell
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref }}
      - uses: actions/download-artifact@v3
        with:
          name: neuron-${{ matrix.flavour }}-fc43
          path: artifacts/
      - name: Build RPM
        run: |
          set -eux
          rm -f ~/.rpmmacros
          rpmdev-setuptree
          cp artifacts/neuron-${{ matrix.flavour }} ~/rpmbuild/SOURCES/
          cp data/neuron.service ~/rpmbuild/SOURCES/
          cp data/neuron-sysusers.conf ~/rpmbuild/SOURCES/
          cp data/neuron-firewalld.xml ~/rpmbuild/SOURCES/
          cp neuron.example.toml ~/rpmbuild/SOURCES/
          cp LICENSE ~/rpmbuild/SOURCES/
          rpmbuild -bb rpm/helexa-neuron-prerelease.spec \
            --define "neuron_version ${{ needs.prepare.outputs.version }}" \
            --define "neuron_flavour ${{ matrix.flavour }}" \
            --define "neuron_prerelease ${{ needs.prepare.outputs.release }}" \
            --undefine dist \
            --define "dist .fc43"
      - uses: actions/upload-artifact@v3
        with:
          name: rpm-neuron-${{ matrix.flavour }}-fc43
          path: ~/rpmbuild/RPMS/x86_64/*.rpm
          retention-days: 7
  publish:
    name: Publish to rpm.lair.cafe (unstable)
    needs: [package-cortex, package-neuron]
    runs-on: rpm
    concurrency:
      group: rpm-publish
      cancel-in-progress: false
    env:
      RPM_REPO_HOST: oolon.kosherinata.internal
      FEDORA_VERSION: "43"
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref }}
      - name: Download all built RPMs
        uses: actions/download-artifact@v3
        with:
          path: rpms/
          pattern: rpm-*-fc43
      - name: Flatten RPM artifacts
        run: |
          set -eux
          find rpms/ -name '*.rpm' -exec mv --target-directory=rpms/ {} +
          find rpms/ -mindepth 1 -type d -empty -delete
          ls -la rpms/
      - name: Check for sequoia-sq
        run: |
          if ! command -v sq &> /dev/null; then
            echo "ERROR: sequoia-sq is not installed. Install with: sudo dnf install sequoia-sq"
            exit 1
          fi
      - name: Import signing key
        env:
          # Pass secrets via env so values stay out of the rendered shell
          # script (which Gitea includes in step logs). Template
          # expansion of ${{ secrets.X }} inside `run:` writes the literal
          # value into the script and depends on Gitea's log masker to
          # scrub it — fragile for multi-line keys.
          RPM_SIGNING_KEY: ${{ secrets.RPM_SIGNING_KEY }}
          RPM_SIGNING_KEY_ID: ${{ secrets.RPM_SIGNING_KEY_ID }}
        run: |
          echo "$RPM_SIGNING_KEY" | gpg --batch --import
          fpr=$(gpg --batch --with-colons --list-keys "$RPM_SIGNING_KEY_ID" | awk -F: '/^fpr:/ { print $10; exit }')
          echo "${fpr}:6:" | gpg --batch --import-ownertrust
          sed "s/@GPG_NAME@/$RPM_SIGNING_KEY_ID/" rpm/rpmmacros > ~/.rpmmacros
      - name: Sign RPMs
        run: |
          set -eux
          for rpm in rpms/*.rpm; do
            echo "signing ${rpm}..."
            rpm --addsign "${rpm}"
          done
      - name: Set up SSH for rsync
        run: |
          install --directory --mode 700 ~/.ssh
          echo "${RSYNC_SSH_KEY}" | install --mode 600 /dev/stdin ~/.ssh/id_ed25519
        env:
          RSYNC_SSH_KEY: ${{ secrets.RSYNC_SSH_KEY }}
      - name: Test SSH connectivity
        run: |
          ssh -o StrictHostKeyChecking=accept-new "gitea_ci@${RPM_REPO_HOST}" exit
      - name: Ensure unstable repo directory exists
        run: |
          ssh "gitea_ci@${RPM_REPO_HOST}" \
            "mkdir --parents /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable"
      - name: Sync RPMs to unstable repo
        run: |
          rsync \
            --archive \
            --verbose \
            --chmod D755,F644 \
            rpms/*.rpm \
            "gitea_ci@${RPM_REPO_HOST}:/var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/"
      - name: Update unstable repo metadata
        run: |
          ssh "gitea_ci@${RPM_REPO_HOST}" \
            "cd /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable && createrepo_c --update ."
      - name: Generate packages.json manifest
        run: |
          scp script/generate-packages-json.py "gitea_ci@${RPM_REPO_HOST}:/tmp/"
          ssh "gitea_ci@${RPM_REPO_HOST}" \
            "python3 /tmp/generate-packages-json.py \
              --repodata-dir /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/repodata \
              --output /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/packages.json \
              --base-url https://rpm.lair.cafe/fedora/${FEDORA_VERSION}/x86_64/unstable"
--- a/.gitea/workflows/ci.yml
+++ b/.gitea/workflows/ci.yml
@@ -16,42 +16,53 @@ env:
  SCCACHE_S3_USE_SSL: "false"
  AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
  # fmt, clippy, and test all run in parallel on the same `rust` runner
  # and would otherwise share /root/.cache/act/<hash>/hostexecutor/target/,
  # racing each other's cargo temp files (.tmpXXXXXX) and failing builds
  # mid-compile. Give each job its own target directory so the invocations
  # don't collide. sccache still backs the actual rustc cache, so the
  # rebuild penalty is small.
  CARGO_TARGET_DIR: target-${{ github.job }}
 jobs:
-  fmt:
+  check:
-    name: Format
+    name: Format, lint, build, test
-    runs-on: rust
+    runs-on: fedora
    steps:
      - uses: actions/checkout@v4
      - run: cargo fmt --check --all
-  clippy:
+      - name: Cache cargo registry and target
-    name: Clippy
+        uses: actions/cache@v4
-    runs-on: rust
+        with:
-    steps:
+          path: |
-      - uses: actions/checkout@v4
+            ~/.cargo/bin
-      - run: cargo clippy --workspace -- -D warnings
+            ~/.cargo/registry/index
-      - run: sccache --show-stats
+            ~/.cargo/registry/cache
            ~/.cargo/git/db
            target
          key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
          restore-keys: |
            ${{ runner.os }}-cargo-
-  test:
+      - name: Ensure sccache with S3 support
-    name: Test
+        env:
-    runs-on: rust
+          RUSTC_WRAPPER: ""
-    steps:
+        run: |
-      - uses: actions/checkout@v4
+          if sccache --version 2>/dev/null && sccache --show-stats 2>/dev/null; then
-      - run: cargo test --workspace
+            echo "sccache with S3 support already installed"
-      - run: sccache --show-stats
+          else
            cargo install sccache --features s3 --locked
          fi
      - name: Check formatting
        run: cargo fmt --check --all
      - name: Clippy
        run: cargo clippy --workspace -- -D warnings
      - name: Test
        run: cargo test --workspace
      - name: Show sccache stats
        run: sccache --show-stats
  srpm-cortex:
    name: Build cortex SRPM
-    runs-on: rpm
+    runs-on: fedora
-    needs: [fmt, clippy, test]
+    needs: check
    if: startsWith(github.ref, 'refs/tags/v')
    steps:
      - uses: actions/checkout@v4
@@ -110,8 +121,8 @@ jobs:
  srpm-neuron:
    name: Build neuron SRPM
-    runs-on: rpm
+    runs-on: fedora
-    needs: [fmt, clippy, test]
+    needs: check
    if: startsWith(github.ref, 'refs/tags/v')
    steps:
      - uses: actions/checkout@v4
@@ -128,37 +139,37 @@ jobs:
        run: |
          VERSION="${{ steps.version.outputs.VERSION }}"
          sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
-          sed -i "s/^Version:.*/Version:        ${VERSION}/" helexa-neuron.spec
+          sed -i "s/^Version:.*/Version:        ${VERSION}/" neuron.spec
      - name: Generate changelog entry
        uses: https://git.lair.cafe/actions/rpm-changelog@v1
        with:
-          spec: helexa-neuron.spec
+          spec: neuron.spec
          version: ${{ steps.version.outputs.VERSION }}
      - name: Generate source tarball
        run: |
          set -ex
          VERSION="${{ steps.version.outputs.VERSION }}"
-          tar czf /tmp/helexa-neuron-${VERSION}.tar.gz \
+          tar czf /tmp/neuron-${VERSION}.tar.gz \
-            --transform "s,^\.,helexa-neuron-${VERSION}," \
+            --transform "s,^\.,neuron-${VERSION}," \
            --exclude='./target' \
            --exclude='./.git' \
            --exclude='*.tar.gz' \
            --exclude='*.src.rpm' \
            .
-          mv /tmp/helexa-neuron-${VERSION}.tar.gz .
+          mv /tmp/neuron-${VERSION}.tar.gz .
      - name: Vendor Rust dependencies
        run: |
          VERSION="${{ steps.version.outputs.VERSION }}"
          cargo vendor vendor/
-          tar czf helexa-neuron-${VERSION}-vendor.tar.gz vendor/
+          tar czf neuron-${VERSION}-vendor.tar.gz vendor/
          rm -rf vendor/
      - name: Build SRPM
        run: |
-          rpmbuild -bs helexa-neuron.spec \
+          rpmbuild -bs neuron.spec \
            --define "_sourcedir $(pwd)" \
            --define "_srcrpmdir $(pwd)"
@@ -170,7 +181,7 @@ jobs:
  copr-cortex:
    name: Publish cortex to COPR
-    runs-on: fedora-43
+    runs-on: fedora
    needs: srpm-cortex
    steps:
      - name: Download SRPM
@@ -181,13 +192,13 @@ jobs:
      - name: Publish to COPR
        uses: https://git.lair.cafe/actions/copr-publish@v1
        with:
-          project: helexa/helexa
+          project: helexa/cortex
          srpm: "*.src.rpm"
          copr-config: ${{ secrets.COPR_CONFIG }}
  copr-neuron:
    name: Publish neuron to COPR
-    runs-on: fedora-43
+    runs-on: fedora
    needs: srpm-neuron
    steps:
      - name: Download SRPM
@@ -198,53 +209,31 @@ jobs:
      - name: Publish to COPR
        uses: https://git.lair.cafe/actions/copr-publish@v1
        with:
-          project: helexa/helexa
+          project: helexa/neuron
          srpm: "*.src.rpm"
          copr-config: ${{ secrets.COPR_CONFIG }}
  bump-version:
    name: Bump version in source
-    runs-on: rust
+    runs-on: fedora
    needs: [copr-cortex, copr-neuron]
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
-      - name: Determine version
+      - name: Stamp version and push
        id: version
        run: echo "VERSION=${GITHUB_REF#refs/tags/v}" >> "$GITHUB_OUTPUT"
      - name: Stamp version
        run: |
          VERSION="${{ steps.version.outputs.VERSION }}"
          sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
          sed -i "s/^Version:.*/Version:        ${VERSION}/" cortex.spec
          sed -i "s/^Version:.*/Version:        ${VERSION}/" helexa-neuron.spec
          cargo check --workspace 2>/dev/null || true
      - name: Generate cortex changelog entry
        uses: https://git.lair.cafe/actions/rpm-changelog@v1
        with:
          spec: cortex.spec
          version: ${{ steps.version.outputs.VERSION }}
      - name: Generate helexa-neuron changelog entry
        uses: https://git.lair.cafe/actions/rpm-changelog@v1
        with:
          spec: helexa-neuron.spec
          version: ${{ steps.version.outputs.VERSION }}
      - name: Commit and push
        env:
          GITEA_TOKEN: ${{ secrets.GITEA_TOKEN }}
        run: |
-          VERSION="${{ steps.version.outputs.VERSION }}"
+          VERSION="${GITHUB_REF#refs/tags/v}"
          sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
          sed -i "s/^Version:.*/Version:        ${VERSION}/" cortex.spec
          sed -i "s/^Version:.*/Version:        ${VERSION}/" neuron.spec
          cargo check --workspace 2>/dev/null || true
          git config user.name "Gitea Actions"
          git config user.email "actions@git.lair.cafe"
-          git add Cargo.toml Cargo.lock cortex.spec helexa-neuron.spec
+          git add Cargo.toml Cargo.lock cortex.spec neuron.spec
          if git diff --cached --quiet; then
-            echo "Nothing to commit for ${VERSION}"
+            echo "Version already at ${VERSION}"
          else
            git commit -m "chore: bump version to ${VERSION}"
            git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/helexa/cortex.git"
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -125,8 +125,7 @@ automatically. Clippy warnings must be resolved, not suppressed with
  - One or more GPU nodes running mistral.rs on port 8080
  - Optionally a metrics-only node (no GPU) for Prometheus/Grafana
 - Each node runs `mistralrs serve` on port 8080
- Gateway listens on port 31313 (API) and 31314 (metrics)
+- Gateway listens on port 8000 (API) and 9100 (metrics)
 - neuron listens on port 13131 on each GPU host
 - TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
 ## Conventions
@@ -381,7 +380,7 @@ processes (one process per loaded model, each on its own port).
 ## neuron API
-neuron exposes an HTTP API on port 13131 that cortex polls and calls.
+neuron exposes an HTTP API on port 9090 that cortex polls and calls.
 ```
 GET  /discovery
@@ -425,8 +424,8 @@ endpoint. cortex.toml shrinks to:
 ```toml
 [gateway]
-listen = "0.0.0.0:31313"
+listen = "0.0.0.0:8000"
-metrics_listen = "0.0.0.0:31314"
+metrics_listen = "0.0.0.0:9100"
 [eviction]
 strategy = "lru"
@@ -434,15 +433,15 @@ defrag_after_cycles = 50
 [[neurons]]
 name = "beast"
-endpoint = "http://beast.hanzalova.internal:13131"
+endpoint = "http://beast.hanzalova.internal:9090"
 [[neurons]]
 name = "benjy"
-endpoint = "http://benjy.hanzalova.internal:13131"
+endpoint = "http://benjy.kosherinata.internal:9090"
 [[neurons]]
 name = "quadbrat"
-endpoint = "http://quadbrat.hanzalova.internal:13131"
+endpoint = "http://quadbrat.hanzalova.internal:9090"
 ```
 On startup and periodically, cortex calls `GET /discovery` and
@@ -522,7 +521,7 @@ cortex/
 │   │       └── metrics.rs      # prometheus exporter (unchanged)
 │   ├── neuron/                 # node plane (replaces cortex-agent)
 │   │   └── src/
-│   │       ├── main.rs         # binary entrypoint, axum server on :13131
+│   │       ├── main.rs         # binary entrypoint, axum server on :9090
 │   │       ├── discovery.rs    # nvidia-smi, device enumeration
 │   │       ├── health.rs       # runtime GPU polling
 │   │       ├── api.rs          # HTTP handlers for /discovery, /models, etc.
@@ -596,65 +595,70 @@ placement matching can be added incrementally.
 Completed. Both packages have RPM specs, systemd units, and example configs.
 CI builds parallel SRPMs on tag push and publishes to separate COPR repos.
- `cortex.spec` — installs the `cortex` binary. Package name keeps the
+- `cortex.spec` → `helexa/cortex` COPR: binary, systemd unit, config files
-  short `cortex` because no Fedora package collides with it.
+- `neuron.spec` → `helexa/neuron` COPR: binary, systemd unit, config
 - `helexa-neuron.spec` — installs the `neuron` binary under package name
  `helexa-neuron`. Renamed from bare `neuron` to avoid collision with
  Fedora's NEURON neural-simulation package
  (https://src.fedoraproject.org/rpms/neuron); binary, systemd unit,
  system user, and config dir all stay named `neuron` since those are
  project-local contexts.
 - `data/cortex.service`, `data/neuron.service` — systemd units
 - `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
- CI: parallel `srpm-cortex` + `srpm-neuron` jobs, then parallel COPR
+- CI: parallel `srpm-cortex` + `srpm-neuron` jobs, then parallel COPR publish
  publish to a single project `helexa/helexa` hosting both packages.
 Install:
 ```sh
-dnf copr enable helexa/helexa
+dnf copr enable helexa/cortex && dnf install cortex    # gateway host
-dnf install cortex                # gateway host
+dnf copr enable helexa/neuron && dnf install neuron    # GPU nodes
 dnf install helexa-neuron         # GPU nodes
 ```
-## 2026-05-18 addendum: candle-native pivot
+### Phase 11: llama.cpp harness stub
-Phases 11 (llama.cpp harness) and 12 (mistral.rs COPR) below are
+**Goal:** Prove the harness abstraction works with a second engine.
 **superseded**. The project no longer treats mistral.rs or llama.cpp as
 dependencies — both are conceptually out of scope. neuron becomes a
 candle-native inference daemon, with `Harness` retained as an
 internal seam for adding future engines (vision/audio/diffusion) but
 its only implementation being in-process candle.
-The full staged plan for this pivot lives at
+**Steps:**
-`~/.claude/plans/create-a-more-aggressive-calm-naur.md`. Summary:
+1. `crates/neuron/src/harness/llamacpp.rs` — implement the `Harness`
   trait for llama.cpp's `llama-server`.
   - `start()` — launch `llama-server` with the correct model path,
     `--port`, `--n-gpu-layers`, `--tensor-split` args. Track the
     child process.
   - `stop()` — send SIGTERM to the child process.
   - `list_models()` — llama-server serves one model per process, so
     return a single-element list.
   - `load_model()` — start a new llama-server process for this model.
   - `unload_model()` — stop the process.
   - `inference_endpoint()` — return `http://localhost:{assigned_port}`.
 2. Port allocation: neuron assigns ports from a range (e.g. 8100-8199)
   to llama-server instances.
 3. Register in `HarnessRegistry` when configured:
   ```toml
   [[harnesses]]
   name = "llamacpp"
   binary = "/usr/local/bin/llama-server"
   port_range = [8100, 8199]
   ```
 4. Tests: mock llama-server (simple HTTP server returning canned
   responses), test load/unload/endpoint lifecycle.
- **Stage 1 (this commit):** delete `mistralrs.rs` and `llamacpp.rs`,
+**Done when:** A model with `harness = "llamacpp"` in `models.toml` can
-  scaffold inert `CandleHarness`, drop `endpoint`/`systemd_unit` from
+be loaded and served through cortex. Tests pass with mock llama-server.
  `HarnessConfig`, default no-op `start`/`stop` on the `Harness` trait.
 - **Stages 2–4:** wire up candle model load/unload (quantized Qwen3
  first), add OpenAI-compatible inference endpoint in neuron, then SSE
  streaming.
 - **Stages 5–6:** load-on-activation (default models in config) and
  unload-on-deactivation (graceful shutdown).
 - **Stages 7–8:** multi-GPU tensor parallelism and broader model/quant
  coverage.
-Sections of this document that describe mistral.rs HTTP behaviour
+### Phase 12 (lower priority): mistral.rs COPR packaging
 ("mistral.rs API gotchas") are retained as historical context for
 Phases 1–10 — they document what was true while the project depended
 on mistral.rs. They do not describe current behaviour.
---
+**Goal:** Fedora RPMs for mistral.rs built against specific CUDA versions.
-### Phase 11 (superseded): llama.cpp harness stub
+**Steps:**
 1. `mistralrs-cuda.spec` — RPM spec that clones a pinned mistral.rs git
   tag, builds with `--features cuda`, links against the system CUDA
   toolkit. Produces `mistralrs-cuda13-server` (CUDA 13.x / sm_120) and
   `mistralrs-cuda12-server` (CUDA 12.x / sm_89). Install binary to
   `/usr/local/bin/mistralrs`.
 2. COPR build config: enable the NVIDIA CUDA repo as a build dependency.
   Pin the CUDA toolkit version in `BuildRequires`.
 3. Gitea Actions or manual workflow: bump the mistral.rs tag in the spec,
   trigger COPR rebuild.
 4. neuron's mistralrs harness config references which binary/package
   provides the mistral.rs binary. neuron could warn at startup if the
   installed mistral.rs CUDA version doesn't match the discovered driver.
-~~Originally planned as a second engine to prove the harness
+**Done when:** `dnf install mistralrs-cuda13-server` on beast provides a
-abstraction.~~ Replaced by the candle harness work in the 2026-05-18
+working `mistralrs` binary built for Blackwell GPUs. `dnf install
-addendum above. llama.cpp's any-model/any-hardware breadth is no
+mistralrs-cuda12-server` on benjy provides one built for Ada GPUs.
 longer in scope for helexa.
-### Phase 12 (superseded): mistral.rs COPR packaging
+This is a separate repo/spec — not part of the cortex workspace — but
-
+tightly coupled operationally. Track it as a sibling project.
 ~~Originally planned to ship CUDA-versioned mistral.rs RPMs.~~ Replaced
 by the candle harness work in the 2026-05-18 addendum above. With
 mistral.rs out of the dependency tree, there is nothing to package.
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -8,7 +8,7 @@ members = [
 ]
 [workspace.package]
-version = "0.1.16"
+version = "0.1.8"
 edition = "2024"
 license = "GPL-3.0-or-later"
 repository = "https://git.lair.cafe/helexa/cortex"
@@ -27,7 +27,7 @@ serde = { version = "1", features = ["derive"] }
 serde_json = "1"
 toml = "0.8"
-# http client (for proxying to neuron backends)
+# http client (for proxying to mistralrs backends)
 reqwest = { version = "0.12", features = ["json", "stream"] }
 # observability
--- a/README.md
+++ b/README.md
@@ -1,23 +1,22 @@
 # cortex
-A Rust reverse-proxy and fleet management layer for multi-node GPU inference
+A Rust reverse-proxy and fleet management layer for multi-node
-clusters. Cortex sits in front of one or more `neuron` daemons (each running
+[mistral.rs](https://github.com/EricLBuehler/mistral.rs) inference clusters.
 candle-based inference on a local GPU host) and presents a unified OpenAI +
 Anthropic compatible API surface.
 ## Problem
 Running local LLMs across multiple GPU nodes (different VRAM tiers, different
 model affinities) requires a unified API surface that:
- Presents a **single `/v1/models` catalogue** merging every model that can be
+- Presents a **single `/v1/models` catalogue** merging every model across every
-  served by any neuron in the fleet.
+  node.
- **Routes requests** to the correct node based on where a model is loaded
+- **Routes requests** to the correct node based on where a model is loaded (or
-  (or can be loaded), handling cold-load and eviction transparently.
+  *can* be loaded).
- Manages **model lifecycle** — load on demand, unload cold models, pin
+- Manages **model lifecycle** — unload cold models, reload on demand, pin
-  critical ones — by calling each neuron's `/models/{load,unload}` API.
+  critical ones — using the mistral.rs
  `/v1/models/{unload,reload,status}` HTTP API (PR #1828+).
 - Translates between **OpenAI and Anthropic** request/response envelopes so
-  every client speaks whichever dialect it prefers.
+  every client in the homelab speaks whichever dialect it prefers.
 - Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes
  them as Prometheus counters/histograms.
@@ -31,17 +30,18 @@ model affinities) requires a unified API surface that:
       └────────────────┴──────┬───────┴───────────────┘
                               │
                    ┌──────────▼──────────┐
-                    │      cortex         │
+                    │   cortex     │
-                    │  (cortex-gateway)   │
+                    │   (cortex-gateway)      │
                    │                     │
                    │  Router · Metrics   │
                    │  Evictor · Translate│
                    └──┬──────┬────────┬──┘
                       │      │        │
            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
-            │  neuron   │  │ neuron │  │  neuron   │
+            │ gpu-large │  │gpu-med │  │ gpu-small │
-            │  :13131   │  │ :13131 │  │  :13131   │
+            │ mistralrs │  │mistral │  │ mistralrs │
-            │  candle   │  │ candle │  │  candle   │
+            │ serve     │  │rs serve│  │ serve     │
            │ :8080     │  │ :8080  │  │  :8080    │
            └───────────┘  └────────┘  └───────────┘
                  private network (.internal)
 ```
@@ -50,48 +50,70 @@ model affinities) requires a unified API surface that:
 | Crate | Purpose |
 |---|---|
-| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
+| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic request/response envelopes |
-| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
+| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, metrics exporter |
-| `neuron` | Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
+| `cortex-agent` | Per-node sidecar: polls local mistralrs, reports to gateway, handles restart/defrag |
 | `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) |
 ## Node setup
-Each GPU node runs `neuron` (listening on `:13131`). Neuron uses
+Each GPU node runs `mistralrs serve` with a multi-model config. Models are
-huggingface/candle for in-process inference — there is no external
+declared but start **unloaded** — mistral.rs lazy-loads on first request and
-inference subprocess to manage.
+the gateway can explicitly unload/reload via the HTTP API.
-The neuron RPM (`helexa-neuron`) ships a systemd unit:
+Example node systemd unit:
-```sh
+```ini
-dnf copr enable helexa/helexa
+# /etc/systemd/system/mistralrs.service
-dnf install helexa-neuron
+[Unit]
-systemctl enable --now neuron
+Description=mistral.rs inference server
 After=network-online.target
 Wants=network-online.target
 [Service]
 Type=simple
 ExecStart=/usr/local/bin/mistralrs serve \
    --from-config /etc/mistralrs/config.toml \
    --port 8080
 Restart=on-failure
 RestartSec=5
 Environment=CUDA_VISIBLE_DEVICES=0,1
 [Install]
 WantedBy=multi-user.target
 ```
 ## Gateway config
 ```toml
-# /etc/cortex/cortex.toml
+# cortex.toml
 [gateway]
-listen = "0.0.0.0:31313"
+listen = "0.0.0.0:8000"
-metrics_listen = "0.0.0.0:31314"
+metrics_listen = "0.0.0.0:9100"
 [eviction]
 strategy = "lru"        # lru | priority
 defrag_after_cycles = 50
-[[neurons]]
+[[nodes]]
-name = "beast"
+name = "gpu-large"
-endpoint = "http://beast.internal:13131"
+endpoint = "http://gpu-large.internal:8080"
 vram_mb = 49_152        # e.g. 2x RTX 4090
 pinned = ["your-org/large-model"]
-[[neurons]]
+[[nodes]]
-name = "benjy"
+name = "gpu-medium"
-endpoint = "http://benjy.internal:13131"
+endpoint = "http://gpu-medium.internal:8080"
 vram_mb = 24_576        # e.g. RTX 4090
 pinned = ["your-org/medium-model"]
 [[nodes]]
 name = "gpu-small"
 endpoint = "http://gpu-small.internal:8080"
 vram_mb = 12_288        # e.g. RTX 3060
 pinned = ["your-org/embedding-model"]
 ```
 Model placement profiles live in `models.toml` — see `models.example.toml`.
 ## Building
 ```sh
@@ -109,20 +131,19 @@ cargo clippy --workspace -- -D warnings   # warnings are errors
 cargo test --workspace                     # all tests must pass
 ```
-Tagged releases (`v*`) additionally build SRPMs for both `cortex` and
+Tagged releases (`v*`) additionally build an SRPM and publish to COPR.
 `helexa-neuron` and publish to COPR.
 ## Running
 ```sh
 # start the gateway
-cortex serve --config /etc/cortex/cortex.toml
+cortex serve --config cortex.toml
 # check fleet status
 cortex status
 # list all models across nodes
-curl http://localhost:31313/v1/models
+curl http://localhost:8000/v1/models
 ```
 ## License
--- a/asset/manifest.yml
+++ b/asset/manifest.yml
@@ -1,30 +0,0 @@
 # Helexa fleet manifest.
 #
 # Drives rolling deploys via script/deploy.sh and serves as the source
 # of truth for which hosts run cortex vs neuron, and which CUDA
 # compute-capability flavour each neuron host needs.
 #
 # Flavour ↔ NVIDIA generation ↔ compute cap:
 #   ampere    sm_86   (RTX 30 series — e.g. 3060)
 #   ada       sm_89   (RTX 40 series — e.g. 4090)
 #   blackwell sm_120  (RTX 50 series — e.g. 5090)
 #
 # The flavour determines which RPM is installed on a given neuron host:
 # helexa-neuron-<flavour>. Only one flavour may be installed at a time
 # (the packages Conflict: with each other).
 cortex:
  host: hanzalova.internal
 neurons:
  - host: beast.hanzalova.internal
    flavour: blackwell
    gpu: "2x RTX 5090"
  - host: benjy.hanzalova.internal
    flavour: ada
    gpu: "RTX 4090"
  - host: quadbrat.hanzalova.internal
    flavour: ampere
    gpu: "RTX 3060"
--- a/cortex.example.toml
+++ b/cortex.example.toml
@@ -3,22 +3,22 @@
 # Copy to cortex.toml and adjust for your environment.
 #
 # Environment variable overrides use CORTEX_ prefix with __ separators:
-#   CORTEX_GATEWAY__LISTEN=0.0.0.0:31313
+#   CORTEX_GATEWAY__LISTEN=0.0.0.0:9000
 [gateway]
-listen = "0.0.0.0:31313"
+listen = "0.0.0.0:8000"
-metrics_listen = "0.0.0.0:31314"
+metrics_listen = "0.0.0.0:9100"
 [eviction]
 strategy = "lru"
-# Restart neurons after this many load/unload cycles to defragment VRAM.
+# Restart mistralrs after this many load/unload cycles to defragment VRAM.
 # Set to 0 to disable.
 defrag_after_cycles = 50
 # -- Nodes ---------------------------------------------------------------
-# Each [[nodes]] entry declares a neuron daemon in the fleet.
+# Each [[nodes]] entry declares a mistral.rs instance in the fleet.
-# Models are discovered by polling the neuron's /models endpoint.
+# Models are discovered by polling the node's /v1/models endpoint.
-# Pinned models (see models.toml) are never evicted.
+# Pinned models are never evicted.
 [[nodes]]
 name = "gpu-large"
--- a/cortex.spec
+++ b/cortex.spec
@@ -1,5 +1,5 @@
 Name:           cortex
-Version:        0.1.16
+Version:        0.1.8
 Release:        1%{?dist}
 Summary:        Inference gateway for multi-node GPU clusters
@@ -21,7 +21,6 @@ BuildRequires:  systemd-rpm-macros
 Requires(pre):  shadow-utils
 Requires:       systemd
 Requires:       firewalld-filesystem
 # systemd-rpm-macros ships a unit dep generator that parses User=/Group=
 # from our .service file and emits Requires: user(cortex)/group(cortex).
@@ -57,7 +56,6 @@ cargo build --release -p cortex-cli
 install -Dm755 target/release/cortex %{buildroot}%{_bindir}/cortex
 install -Dm644 data/cortex.service %{buildroot}%{_unitdir}/cortex.service
 install -Dm644 data/cortex-sysusers.conf %{buildroot}%{_sysusersdir}/cortex.conf
 install -Dm644 data/cortex-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/cortex.xml
 install -dm755 %{buildroot}%{_sysconfdir}/cortex
 install -Dm644 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml
 install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
@@ -80,21 +78,10 @@ install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
 %{_bindir}/cortex
 %{_unitdir}/cortex.service
 %{_sysusersdir}/cortex.conf
 %{_prefix}/lib/firewalld/services/cortex.xml
 %dir %{_sysconfdir}/cortex
 %config(noreplace) %{_sysconfdir}/cortex/cortex.toml
 %config(noreplace) %{_sysconfdir}/cortex/models.toml
 %changelog
 * Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.16-1
 - chore: ignore local deploy script
 - chore: move default ports out of common-collision ranges
 - ci: drop actions/cache for cargo registry and target
 * Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.14-1
 - ci: publish both packages to a single helexa/helexa COPR project
 - fix(rpm): rename neuron package to helexa-neuron
 - ci: commit generated %changelog entries back to main
 * Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1
 - Initial package
--- a/crates/cortex-cli/src/main.rs
+++ b/crates/cortex-cli/src/main.rs
@@ -5,7 +5,7 @@ use tracing_subscriber::EnvFilter;
 #[derive(Parser)]
 #[command(name = "cortex")]
-#[command(about = "Unified inference gateway for multi-node GPU clusters")]
+#[command(about = "Unified inference gateway for multi-node mistral.rs clusters")]
 #[command(version)]
 struct Cli {
    #[command(subcommand)]
@@ -23,7 +23,7 @@ enum Commands {
    /// Print the fleet status (models, nodes, health).
    Status {
        /// Gateway API endpoint to query.
-        #[arg(short, long, default_value = "http://localhost:31313")]
+        #[arg(short, long, default_value = "http://localhost:8000")]
        endpoint: String,
    },
 }
--- a/crates/cortex-core/src/anthropic.rs
+++ b/crates/cortex-core/src/anthropic.rs
@@ -2,7 +2,7 @@
 //!
 //! These mirror the `/v1/messages` format used by the Anthropic API.
 //! The gateway accepts these, translates to OpenAI format, proxies to
-//! the inference backend (neuron), then translates the response back.
+//! mistral.rs, then translates the response back.
 use serde::{Deserialize, Serialize};
 use serde_json::Value;
--- a/crates/cortex-core/src/config.rs
+++ b/crates/cortex-core/src/config.rs
@@ -22,9 +22,9 @@ fn default_models_path() -> String {
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct GatewaySettings {
-    /// Address to listen on for API requests (e.g. "0.0.0.0:31313")
+    /// Address to listen on for API requests (e.g. "0.0.0.0:8000")
    pub listen: String,
-    /// Address to listen on for Prometheus metrics (e.g. "0.0.0.0:31314")
+    /// Address to listen on for Prometheus metrics (e.g. "0.0.0.0:9100")
    pub metrics_listen: String,
 }
@@ -50,7 +50,7 @@ pub enum EvictionStrategy {
 pub struct NeuronEndpoint {
    /// Human-readable node name (e.g. "beast")
    pub name: String,
-    /// Base URL of the neuron daemon (e.g. "http://beast.internal:13131")
+    /// Base URL of the neuron daemon (e.g. "http://beast.internal:9090")
    pub endpoint: String,
 }
@@ -70,8 +70,8 @@ impl Default for GatewayConfig {
    fn default() -> Self {
        Self {
            gateway: GatewaySettings {
-                listen: "0.0.0.0:31313".into(),
+                listen: "0.0.0.0:8000".into(),
-                metrics_listen: "0.0.0.0:31314".into(),
+                metrics_listen: "0.0.0.0:9100".into(),
            },
            eviction: EvictionSettings {
                strategy: EvictionStrategy::Lru,
--- a/crates/cortex-core/src/harness.rs
+++ b/crates/cortex-core/src/harness.rs
@@ -9,13 +9,13 @@ use async_trait::async_trait;
 use serde::{Deserialize, Serialize};
 /// Configuration for a harness instance on a neuron.
 ///
 /// All current harnesses are in-process (candle); per-harness tuning
 /// (cache paths, device policies, etc.) lives in dedicated config
 /// blocks rather than on this struct.
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct HarnessConfig {
    pub name: String,
    /// Base URL of the harness (e.g. "http://localhost:8080" for mistral.rs).
    pub endpoint: Option<String>,
    /// Systemd unit name, if the harness is managed via systemd.
    pub systemd_unit: Option<String>,
 }
 /// Health status of a harness process.
@@ -47,24 +47,16 @@ pub struct ModelInfo {
 }
 /// What an inference harness must do, from neuron's perspective.
 ///
 /// All current harnesses are in-process — they share neuron's address
 /// space and lifecycle. `start`/`stop` therefore default to no-ops; a
 /// future process-supervising harness would override them.
 #[async_trait]
 pub trait Harness: Send + Sync {
-    /// Human-readable name (e.g. "candle").
+    /// Human-readable name (e.g. "mistralrs", "llamacpp", "comfyui").
    fn name(&self) -> &str;
-    /// Start the harness. Default no-op for in-process harnesses.
+    /// Start the harness process if it is not already running.
-    async fn start(&self, _config: &HarnessConfig) -> Result<()> {
+    async fn start(&self, config: &HarnessConfig) -> Result<()>;
        Ok(())
    }
-    /// Stop the harness. Default no-op for in-process harnesses.
+    /// Stop the harness process gracefully.
-    async fn stop(&self) -> Result<()> {
+    async fn stop(&self) -> Result<()>;
        Ok(())
    }
    /// Health check. Returns the harness process status.
    async fn health(&self) -> HarnessHealth;
--- a/crates/cortex-core/src/node.rs
+++ b/crates/cortex-core/src/node.rs
@@ -6,7 +6,7 @@ use std::collections::HashMap;
 #[derive(Debug, Clone)]
 pub struct NodeState {
    pub name: String,
-    /// Base URL of the neuron daemon (e.g. "http://beast.internal:13131").
+    /// Base URL of the neuron daemon (e.g. "http://beast.internal:9090").
    pub endpoint: String,
    pub healthy: bool,
    pub models: HashMap<String, ModelEntry>,
--- a/crates/cortex-core/src/openai.rs
+++ b/crates/cortex-core/src/openai.rs
@@ -3,7 +3,7 @@
 //! These are a subset sufficient for chat completions (streaming + non-streaming).
 //! Fields not relevant to proxying are captured as `serde_json::Value` via
 //! `#[serde(flatten)]` so we forward them without needing to enumerate every
-//! extension field a backend might support.
+//! extension field mistral.rs supports.
 use serde::{Deserialize, Serialize};
 use serde_json::Value;
@@ -22,7 +22,7 @@ pub struct ChatCompletionRequest {
    pub max_tokens: Option<u64>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub stream: Option<bool>,
-    /// All other fields (tools, response_format, backend extensions, etc.)
+    /// All other fields (tools, response_format, mistral.rs extensions, etc.)
    #[serde(flatten)]
    pub extra: Value,
 }
--- a/crates/cortex-gateway/src/proxy.rs
+++ b/crates/cortex-gateway/src/proxy.rs
@@ -1,4 +1,4 @@
-//! Streaming HTTP reverse proxy to neuron backends.
+//! Streaming HTTP reverse proxy to mistral.rs backends.
 //!
 //! For streaming requests, SSE chunks are forwarded as they arrive.
 //! The proxy captures timing information for metrics but does not
--- a/crates/cortex-gateway/tests/common/mod.rs
+++ b/crates/cortex-gateway/tests/common/mod.rs
@@ -22,7 +22,6 @@ use tokio::net::TcpListener;
 /// - GET /models/:id/endpoint (returns the inference URL)
 /// - POST /models/unload (accepts unload requests)
 /// - GET /v1/chat/completions + POST /v1/chat/completions (inference)
 ///
 /// Returns the neuron base URL.
 pub async fn spawn_mock_neuron() -> String {
    let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
@@ -55,7 +54,7 @@ pub async fn spawn_mock_neuron() -> String {
 async fn mock_neuron_list_models() -> Json<Value> {
    Json(json!([
-        {"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000}
+        {"id": "test-model", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000}
    ]))
 }
--- a/crates/cortex-gateway/tests/poller.rs
+++ b/crates/cortex-gateway/tests/poller.rs
@@ -12,8 +12,8 @@ use std::sync::Arc;
 async fn test_poller_discovers_models() {
    // Mock neuron reports 2 models via /models endpoint (neuron format).
    let mock_url = common::spawn_mock_neuron_with_models(json!([
-        {"id": "model-a", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000},
+        {"id": "model-a", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000},
-        {"id": "model-b", "harness": "candle", "status": "unloaded", "devices": [], "vram_used_mb": null}
+        {"id": "model-b", "harness": "mistralrs", "status": "unloaded", "devices": [], "vram_used_mb": null}
    ]))
    .await;
@@ -63,8 +63,8 @@ async fn test_poller_discovers_models() {
 #[tokio::test]
 async fn test_poller_updates_gateway_models_endpoint() {
    let mock_url = common::spawn_mock_neuron_with_models(json!([
-        {"id": "model-x", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null},
+        {"id": "model-x", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null},
-        {"id": "model-y", "harness": "candle", "status": "loaded", "devices": [1], "vram_used_mb": null}
+        {"id": "model-y", "harness": "mistralrs", "status": "loaded", "devices": [1], "vram_used_mb": null}
    ]))
    .await;
@@ -152,8 +152,8 @@ async fn test_poller_marks_unreachable_node_unhealthy() {
 #[tokio::test]
 async fn test_poller_removes_stale_models() {
    let mock_url = common::spawn_mock_neuron_with_models(json!([
-        {"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null},
+        {"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null},
-        {"id": "drop-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}
+        {"id": "drop-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}
    ]))
    .await;
@@ -183,7 +183,7 @@ async fn test_poller_removes_stale_models() {
    // New mock with only one model.
    let new_mock_url = common::spawn_mock_neuron_with_models(json!([
-        {"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}
+        {"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}
    ]))
    .await;
--- a/crates/cortex-gateway/tests/streaming.rs
+++ b/crates/cortex-gateway/tests/streaming.rs
@@ -51,18 +51,18 @@ async fn test_streaming_sse_passthrough() {
    }
    assert!(
-        chunks.len() > chunk_count,
+        chunks.len() >= chunk_count + 1,
-        "expected more than {} chunks (got {}): {:?}",
+        "expected at least {} chunks (got {}): {:?}",
-        chunk_count,
+        chunk_count + 1,
        chunks.len(),
        chunks,
    );
    assert_eq!(chunks.last().unwrap(), "[DONE]");
-    for (i, chunk) in chunks.iter().enumerate().take(chunk_count) {
+    for i in 0..chunk_count {
        let chunk_json: serde_json::Value =
-            serde_json::from_str(chunk).expect("chunk should be valid JSON");
+            serde_json::from_str(&chunks[i]).expect("chunk should be valid JSON");
        assert_eq!(
            chunk_json["choices"][0]["delta"]["content"],
            format!("token{i}")
--- a/crates/neuron/Cargo.toml
+++ b/crates/neuron/Cargo.toml
@@ -12,30 +12,6 @@ path = "src/lib.rs"
 name = "neuron"
 path = "src/main.rs"
 [features]
 default = []
 # Enables CUDA acceleration in candle. Without this feature, candle
 # compiles for CPU only and Device::new_cuda calls fall back to CPU.
 cuda = [
    "candle-core/cuda",
    "candle-nn/cuda",
    "candle-transformers/cuda",
 ]
 # Use cuDNN for convolution / attention kernels. Requires CUDA.
 cudnn = [
    "cuda",
    "candle-core/cudnn",
    "candle-nn/cudnn",
    "candle-transformers/cudnn",
 ]
 # FlashAttention kernels. Requires CUDA.
 flash-attn = [
    "cuda",
    "candle-transformers/flash-attn",
 ]
 # Reserved for GPU-only integration tests in later stages.
 cuda-integration = ["cuda"]
 [dependencies]
 cortex-core.workspace = true
 tokio.workspace = true
@@ -48,21 +24,9 @@ tracing-subscriber.workspace = true
 anyhow.workspace = true
 async-trait.workspace = true
 clap.workspace = true
 thiserror.workspace = true
 futures.workspace = true
 tokio-stream.workspace = true
 figment.workspace = true
 toml.workspace = true
 # candle for in-process inference. CUDA support is gated behind the
 # crate's `cuda` feature (default off) so the workspace builds on
 # non-CUDA hosts and CI runners.
 candle-core = "0.10.2"
 candle-nn = "0.10.2"
 candle-transformers = "0.10.2"
 tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
 hf-hub = { version = "0.4", features = ["tokio"] }
 [dev-dependencies]
 tokio = { workspace = true, features = ["test-util"] }
 reqwest.workspace = true
--- a/crates/neuron/src/api.rs
+++ b/crates/neuron/src/api.rs
@@ -1,33 +1,23 @@
 //! HTTP API handlers for the neuron daemon.
 use crate::harness::HarnessRegistry;
 use crate::harness::candle::{CandleHarness, InferenceError};
 use crate::health::HealthCache;
 use axum::Router;
 use axum::extract::{Path, State};
 use axum::http::StatusCode;
 use axum::response::sse::{Event, KeepAlive, Sse};
 use axum::response::{IntoResponse, Json};
 use axum::routing::{get, post};
 use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
 use cortex_core::harness::ModelSpec;
 use cortex_core::openai::ChatCompletionRequest;
 use futures::stream::{self, StreamExt};
 use serde_json::{Value, json};
 use std::convert::Infallible;
 use std::sync::Arc;
 use tokio::sync::RwLock;
 use tokio_stream::wrappers::ReceiverStream;
 /// Shared state for the neuron HTTP server.
 pub struct NeuronState {
    pub discovery: DiscoveryResponse,
    pub health_cache: Arc<HealthCache>,
    pub registry: RwLock<HarnessRegistry>,
    /// Typed handle to the candle harness for inference routes. Cached at
    /// startup so `/v1/chat/completions` doesn't have to hold the registry
    /// read lock or perform dyn-Trait dispatch per request.
    pub candle: Option<Arc<CandleHarness>>,
 }
 /// Build the neuron API router.
@@ -39,7 +29,6 @@ pub fn neuron_routes() -> Router<Arc<NeuronState>> {
        .route("/models/load", post(load_model))
        .route("/models/unload", post(unload_model))
        .route("/models/{model_id}/endpoint", get(model_endpoint))
        .route("/v1/chat/completions", post(chat_completions))
 }
 async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> {
@@ -56,7 +45,7 @@ async fn list_models(State(state): State<Arc<NeuronState>>) -> impl IntoResponse
        Ok(models) => Json(json!(models)).into_response(),
        Err(e) => (
            StatusCode::INTERNAL_SERVER_ERROR,
-            Json(json!({"error": format!("{e:#}")})),
+            Json(json!({"error": e.to_string()})),
        )
            .into_response(),
    }
@@ -69,22 +58,11 @@ async fn load_model(
    let registry = state.registry.read().await;
    match registry.load_model(&spec).await {
        Ok(()) => Json(json!({"status": "loaded"})).into_response(),
-        Err(e) => {
+        Err(e) => (
-            // Log the full anyhow chain server-side so journalctl shows
+            StatusCode::BAD_REQUEST,
-            // the underlying failure (hf-hub timeout, permission denied,
+            Json(json!({"error": e.to_string()})),
-            // disk full, etc.) without needing to inspect the HTTP
+        )
-            // response body separately.
+            .into_response(),
            tracing::warn!(
                model = %spec.model_id,
                error = %format!("{e:#}"),
                "load_model failed"
            );
            (
                StatusCode::BAD_REQUEST,
                Json(json!({"error": format!("{e:#}")})),
            )
                .into_response()
        }
    }
 }
@@ -106,11 +84,7 @@ async fn unload_model(
    let registry = state.registry.read().await;
    match registry.unload_model(&model_id).await {
        Ok(()) => Json(json!({"status": "unloaded"})).into_response(),
-        Err(e) => (
+        Err(e) => (StatusCode::NOT_FOUND, Json(json!({"error": e.to_string()}))).into_response(),
            StatusCode::NOT_FOUND,
            Json(json!({"error": format!("{e:#}")})),
        )
            .into_response(),
    }
 }
@@ -128,61 +102,3 @@ async fn model_endpoint(
            .into_response(),
    }
 }
 /// OpenAI-compatible chat completions. Dispatches to streaming SSE when
 /// `stream: true` is set on the request; otherwise returns a single
 /// `ChatCompletionResponse`.
 async fn chat_completions(
    State(state): State<Arc<NeuronState>>,
    Json(req): Json<ChatCompletionRequest>,
 ) -> impl IntoResponse {
    let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
        return (
            StatusCode::SERVICE_UNAVAILABLE,
            Json(json!({"error": "candle harness not enabled on this neuron"})),
        )
            .into_response();
    };
    if req.stream.unwrap_or(false) {
        match candle.chat_completion_stream(req).await {
            Ok(rx) => {
                // Each chunk → one SSE `data: {json}` line. After the
                // channel closes, append the OpenAI [DONE] terminator.
                let body_stream = ReceiverStream::new(rx).map(|chunk| {
                    let body = serde_json::to_string(&chunk).unwrap_or_default();
                    Ok::<_, Infallible>(Event::default().data(body))
                });
                let done_stream =
                    stream::once(async { Ok::<_, Infallible>(Event::default().data("[DONE]")) });
                Sse::new(body_stream.chain(done_stream))
                    .keep_alive(KeepAlive::default())
                    .into_response()
            }
            Err(InferenceError::ModelNotLoaded(id)) => (
                StatusCode::NOT_FOUND,
                Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
            )
                .into_response(),
            Err(InferenceError::Other(e)) => (
                StatusCode::INTERNAL_SERVER_ERROR,
                Json(json!({"error": format!("{e:#}")})),
            )
                .into_response(),
        }
    } else {
        match candle.chat_completion(req).await {
            Ok(resp) => Json(resp).into_response(),
            Err(InferenceError::ModelNotLoaded(id)) => (
                StatusCode::NOT_FOUND,
                Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
            )
                .into_response(),
            Err(InferenceError::Other(e)) => (
                StatusCode::INTERNAL_SERVER_ERROR,
                Json(json!({"error": format!("{e:#}")})),
            )
                .into_response(),
        }
    }
 }
--- a/crates/neuron/src/config.rs
+++ b/crates/neuron/src/config.rs
@@ -1,12 +1,12 @@
 //! Neuron configuration loaded from neuron.toml.
-use cortex_core::harness::{HarnessConfig, ModelSpec};
+use cortex_core::harness::HarnessConfig;
 use figment::{
    Figment,
    providers::{Env, Format, Toml},
 };
 use serde::{Deserialize, Serialize};
-use std::path::{Path, PathBuf};
+use std::path::Path;
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct NeuronConfig {
@@ -14,35 +14,10 @@ pub struct NeuronConfig {
    pub port: u16,
    #[serde(default)]
    pub harnesses: Vec<HarnessConfig>,
    /// Per-harness configuration. Currently only `candle` is recognised.
    #[serde(default)]
    pub harness: HarnessSettings,
    /// Models to auto-load when the neuron service activates. Each entry
    /// is loaded sequentially before the HTTP listener binds. A failure
    /// on any single entry logs a warning and proceeds — broken entries
    /// don't prevent the rest of the fleet from starting.
    #[serde(default)]
    pub default_models: Vec<ModelSpec>,
 }
 /// Settings for individual harness implementations. Each harness owns
 /// its own sub-table so users only configure the harnesses they enable.
 #[derive(Debug, Clone, Default, Serialize, Deserialize)]
 pub struct HarnessSettings {
    #[serde(default)]
    pub candle: CandleHarnessConfig,
 }
 #[derive(Debug, Clone, Default, Serialize, Deserialize)]
 pub struct CandleHarnessConfig {
    /// HuggingFace cache directory for model weights.
    /// When unset, defers to hf-hub's default (~/.cache/huggingface).
    #[serde(default)]
    pub hf_cache: Option<PathBuf>,
 }
 fn default_port() -> u16 {
-    13131
+    9090
 }
 impl NeuronConfig {
@@ -58,10 +33,8 @@ impl NeuronConfig {
 impl Default for NeuronConfig {
    fn default() -> Self {
        Self {
-            port: 13131,
+            port: 9090,
            harnesses: vec![],
            harness: HarnessSettings::default(),
            default_models: vec![],
        }
    }
 }
--- a/crates/neuron/src/harness/candle.rs
+++ b/crates/neuron/src/harness/candle.rs
@@ -1,687 +0,0 @@
 //! Candle harness — in-process inference using huggingface/candle.
 //!
 //! This is the sole `Harness` implementation. Inference runs inside
 //! the neuron process; there is no external subprocess.
 //!
 //! - Stage 2 wired GGUF (Qwen3 only) load/unload via `quantized_qwen3`.
 //! - Stage 3 (this) adds `chat_completion` — a non-streaming OpenAI
 //!   compatible chat completion routed to the loaded model's forward
 //!   pass on a per-model serialised generation loop.
 use anyhow::{Context, Result};
 use async_trait::async_trait;
 use candle_core::quantized::gguf_file;
 use candle_core::{Device, Tensor};
 use candle_transformers::generation::{LogitsProcessor, Sampling};
 use candle_transformers::models::quantized_qwen3::ModelWeights as QuantizedQwen3Weights;
 use cortex_core::harness::{Harness, HarnessHealth, ModelInfo, ModelSpec};
 use cortex_core::openai::{
    ChatCompletionChoice, ChatCompletionChunk, ChatCompletionRequest, ChatCompletionResponse,
    ChatMessage, ChunkChoice, MessageContent, Usage,
 };
 use serde_json::json;
 use std::collections::HashMap;
 use std::path::PathBuf;
 use std::sync::Arc;
 use std::time::{SystemTime, UNIX_EPOCH};
 use tokenizers::Tokenizer;
 use tokio::sync::{Mutex, RwLock, mpsc};
 /// In-process candle harness. Owns the loaded model registry.
 pub struct CandleHarness {
    models: Arc<RwLock<HashMap<String, Arc<LoadedModel>>>>,
    hf_cache: Option<PathBuf>,
    bind_url: String,
 }
 /// A loaded model with its tokenizer, device placement, and architecture-
 /// specific weights. The `arch` is `Arc<Mutex<>>` so the lock guard can be
 /// moved into `spawn_blocking` for synchronous candle forward passes.
 pub struct LoadedModel {
    pub model_id: String,
    pub arch: Arc<Mutex<ModelArch>>,
    pub tokenizer: Tokenizer,
    pub device: Device,
    pub quant: Option<String>,
    pub devices: Vec<u32>,
 }
 /// Architecture-specific weights. Stage 3 still supports only Qwen3
 /// quantized; Stage 8 broadens this to additional families and
 /// non-quantized variants.
 pub enum ModelArch {
    Qwen3Quantized(QuantizedQwen3Weights),
 }
 impl CandleHarness {
    pub fn new(bind_url: String, hf_cache: Option<PathBuf>) -> Self {
        Self {
            models: Arc::new(RwLock::new(HashMap::new())),
            hf_cache,
            bind_url,
        }
    }
    /// Pick a candle `Device` for the requested indices. Without the
    /// `cuda` feature, or if CUDA initialisation fails, falls back to CPU.
    fn pick_device(devices: &[u32]) -> Result<Device> {
        let _idx = devices.first().copied().unwrap_or(0) as usize;
        #[cfg(feature = "cuda")]
        {
            match Device::new_cuda(_idx) {
                Ok(d) => return Ok(d),
                Err(e) => tracing::warn!(
                    device = _idx,
                    error = %e,
                    "CUDA device unavailable, falling back to CPU"
                ),
            }
        }
        Ok(Device::Cpu)
    }
    /// Resolve a model spec to local GGUF and tokenizer file paths via
    /// hf-hub. Downloads on first use; subsequent calls are cached.
    async fn resolve_files(&self, spec: &ModelSpec) -> Result<(PathBuf, PathBuf)> {
        let mut builder = hf_hub::api::tokio::ApiBuilder::new();
        if let Some(cache) = &self.hf_cache {
            builder = builder.with_cache_dir(cache.clone());
        }
        let api = builder.build().context("build hf-hub API")?;
        let repo = api.model(spec.model_id.clone());
        let info = repo
            .info()
            .await
            .with_context(|| format!("fetch HF repo info for {}", spec.model_id))?;
        let quant = spec.quant.as_deref().unwrap_or("");
        let quant_lc = quant.to_lowercase();
        let gguf_filename = info
            .siblings
            .iter()
            .map(|s| s.rfilename.as_str())
            .filter(|name| name.to_lowercase().ends_with(".gguf"))
            .find(|name| quant_lc.is_empty() || name.to_lowercase().contains(&quant_lc))
            .ok_or_else(|| {
                anyhow::anyhow!(
                    "no GGUF file matching quant {:?} in repo {}",
                    spec.quant,
                    spec.model_id
                )
            })?
            .to_string();
        tracing::info!(
            model = %spec.model_id,
            file = %gguf_filename,
            "resolving GGUF (may be cached)"
        );
        let gguf_path = repo
            .get(&gguf_filename)
            .await
            .with_context(|| format!("fetch GGUF {gguf_filename}"))?;
        // GGUF-only HF repos (unsloth/Qwen3-*-GGUF, Qwen/Qwen3-*-GGUF,
        // etc.) ship the .gguf file but not tokenizer.json — the
        // tokenizer.json lives in the base non-GGUF repo. Derive the
        // base repo id by stripping a `-GGUF` / `-gguf` suffix; if
        // there's no such suffix the same repo is used (works for
        // non-GGUF model_ids).
        let tokenizer_repo_id = spec
            .model_id
            .strip_suffix("-GGUF")
            .or_else(|| spec.model_id.strip_suffix("-gguf"))
            .unwrap_or(spec.model_id.as_str())
            .to_string();
        let tokenizer_repo = if tokenizer_repo_id == spec.model_id {
            repo
        } else {
            tracing::debug!(
                from = %spec.model_id,
                to = %tokenizer_repo_id,
                "tokenizer.json sourced from base repo (GGUF suffix stripped)"
            );
            api.model(tokenizer_repo_id.clone())
        };
        let tokenizer_path = tokenizer_repo
            .get("tokenizer.json")
            .await
            .with_context(|| format!("fetch tokenizer.json from {tokenizer_repo_id}"))?;
        Ok((gguf_path, tokenizer_path))
    }
    /// Run a non-streaming chat completion against a loaded model.
    ///
    /// Returns a typed `InferenceError` when the model isn't loaded so the
    /// handler can map to an appropriate HTTP status without string-matching.
    pub async fn chat_completion(
        &self,
        request: ChatCompletionRequest,
    ) -> Result<ChatCompletionResponse, InferenceError> {
        let loaded = {
            let models = self.models.read().await;
            models.get(&request.model).cloned()
        };
        let loaded = loaded.ok_or_else(|| InferenceError::ModelNotLoaded(request.model.clone()))?;
        let prompt = format_qwen3_prompt(&request.messages);
        let encoding = loaded
            .tokenizer
            .encode(prompt.as_str(), true)
            .map_err(|e| InferenceError::Other(anyhow::anyhow!("tokenize: {e}")))?;
        let prompt_tokens: Vec<u32> = encoding.get_ids().to_vec();
        let prompt_len = prompt_tokens.len();
        let temperature = request.temperature.unwrap_or(0.7);
        let top_p = request.top_p;
        let max_new = request.max_tokens.unwrap_or(512) as usize;
        let seed = unix_subsec_nanos();
        let eos_id = loaded
            .tokenizer
            .token_to_id("<|im_end|>")
            .or_else(|| loaded.tokenizer.token_to_id("<|endoftext|>"));
        let arch_arc = Arc::clone(&loaded.arch);
        let device = loaded.device.clone();
        let model_id = request.model.clone();
        let (generated_ids, finish_reason) =
            tokio::task::spawn_blocking(move || -> Result<(Vec<u32>, String)> {
                let mut guard = arch_arc.blocking_lock();
                run_inference(
                    &mut guard,
                    &device,
                    &prompt_tokens,
                    max_new,
                    temperature,
                    top_p,
                    seed,
                    eos_id,
                )
            })
            .await
            .map_err(|e| InferenceError::Other(anyhow::anyhow!("inference task panicked: {e}")))?
            .map_err(InferenceError::Other)?;
        let completion_text = loaded
            .tokenizer
            .decode(&generated_ids, true)
            .map_err(|e| InferenceError::Other(anyhow::anyhow!("detokenize: {e}")))?;
        let usage = Usage {
            prompt_tokens: prompt_len as u64,
            completion_tokens: generated_ids.len() as u64,
            total_tokens: (prompt_len + generated_ids.len()) as u64,
        };
        Ok(ChatCompletionResponse {
            id: format!("chatcmpl-{:x}", unix_subsec_nanos()),
            object: "chat.completion".into(),
            created: unix_now_secs(),
            model: model_id,
            choices: vec![ChatCompletionChoice {
                index: 0,
                message: ChatMessage {
                    role: "assistant".into(),
                    content: MessageContent::Text(completion_text),
                    extra: serde_json::Value::Object(Default::default()),
                },
                finish_reason: Some(finish_reason),
                extra: serde_json::Value::Object(Default::default()),
            }],
            usage: Some(usage),
            extra: serde_json::Value::Object(Default::default()),
        })
    }
    /// Run a streaming chat completion against a loaded model.
    ///
    /// Returns an `mpsc::Receiver` that yields `ChatCompletionChunk`s in
    /// OpenAI SSE format. The first chunk carries the assistant role;
    /// subsequent chunks carry incremental `content` deltas; the final
    /// chunk carries `finish_reason`. The handler is responsible for
    /// wrapping these into an SSE response and appending the `[DONE]`
    /// terminator.
    ///
    /// Token-by-token decoding tracks the cumulative decoded prefix so
    /// BPE byte-fallback boundaries don't split a UTF-8 char across
    /// chunks.
    pub async fn chat_completion_stream(
        &self,
        request: ChatCompletionRequest,
    ) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
        let loaded = {
            let models = self.models.read().await;
            models.get(&request.model).cloned()
        };
        let loaded = loaded.ok_or_else(|| InferenceError::ModelNotLoaded(request.model.clone()))?;
        let prompt = format_qwen3_prompt(&request.messages);
        let encoding = loaded
            .tokenizer
            .encode(prompt.as_str(), true)
            .map_err(|e| InferenceError::Other(anyhow::anyhow!("tokenize: {e}")))?;
        let prompt_tokens: Vec<u32> = encoding.get_ids().to_vec();
        let temperature = request.temperature.unwrap_or(0.7);
        let top_p = request.top_p;
        let max_new = request.max_tokens.unwrap_or(512) as usize;
        let seed = unix_subsec_nanos();
        let eos_id = loaded
            .tokenizer
            .token_to_id("<|im_end|>")
            .or_else(|| loaded.tokenizer.token_to_id("<|endoftext|>"));
        let arch_arc = Arc::clone(&loaded.arch);
        let device = loaded.device.clone();
        let tokenizer = loaded.tokenizer.clone();
        let model_id = request.model.clone();
        let id = format!("chatcmpl-{:x}", unix_subsec_nanos());
        let created = unix_now_secs();
        // Bounded channel so the producer (blocking inference) is back-
        // pressured by the consumer (SSE writer). 32 is generous —
        // tokens arrive one at a time and the SSE writer is async.
        let (tx, rx) = mpsc::channel::<ChatCompletionChunk>(32);
        // Lead chunk: announce the assistant role per OpenAI streaming
        // conventions. Tools that auto-detect a streaming reply expect
        // this before any content delta.
        let role_chunk = ChatCompletionChunk {
            id: id.clone(),
            object: "chat.completion.chunk".into(),
            created,
            model: model_id.clone(),
            choices: vec![ChunkChoice {
                index: 0,
                delta: json!({"role": "assistant"}),
                finish_reason: None,
                extra: serde_json::Value::Object(Default::default()),
            }],
            usage: None,
            extra: serde_json::Value::Object(Default::default()),
        };
        // If sending the role chunk fails the receiver is already gone;
        // bail before kicking off the heavy blocking work.
        tx.send(role_chunk)
            .await
            .map_err(|_| InferenceError::Other(anyhow::anyhow!("client disconnected")))?;
        tokio::task::spawn_blocking(move || {
            let mut guard = arch_arc.blocking_lock();
            if let Err(e) = run_inference_streaming(
                &mut guard,
                &device,
                &tokenizer,
                &prompt_tokens,
                max_new,
                temperature,
                top_p,
                seed,
                eos_id,
                &id,
                created,
                &model_id,
                &tx,
            ) {
                tracing::warn!(model = %model_id, error = %e, "streaming inference failed");
            }
        });
        Ok(rx)
    }
 }
 #[async_trait]
 impl Harness for CandleHarness {
    fn name(&self) -> &str {
        "candle"
    }
    async fn health(&self) -> HarnessHealth {
        HarnessHealth {
            name: "candle".into(),
            running: true,
            uptime_secs: None,
        }
    }
    async fn list_models(&self) -> Result<Vec<ModelInfo>> {
        let models = self.models.read().await;
        Ok(models
            .values()
            .map(|m| ModelInfo {
                id: m.model_id.clone(),
                harness: "candle".into(),
                status: "loaded".into(),
                devices: m.devices.clone(),
                vram_used_mb: None,
            })
            .collect())
    }
    async fn load_model(&self, spec: &ModelSpec) -> Result<()> {
        if spec.harness != "candle" {
            anyhow::bail!("expected harness=candle, got harness={}", spec.harness);
        }
        {
            let models = self.models.read().await;
            if models.contains_key(&spec.model_id) {
                anyhow::bail!("model '{}' already loaded", spec.model_id);
            }
        }
        let devices = spec.devices.clone().unwrap_or_else(|| vec![0]);
        let device = Self::pick_device(&devices)?;
        let (gguf_path, tokenizer_path) = self.resolve_files(spec).await?;
        let tokenizer = Tokenizer::from_file(&tokenizer_path)
            .map_err(|e| anyhow::anyhow!("load tokenizer: {e}"))?;
        // File I/O + GGUF parsing + tensor materialisation are CPU-bound,
        // so run them on a blocking task to avoid stalling the runtime.
        let device_for_load = device.clone();
        let gguf_path_for_load = gguf_path.clone();
        let model_id_for_log = spec.model_id.clone();
        let arch = tokio::task::spawn_blocking(move || -> Result<ModelArch> {
            tracing::info!(model = %model_id_for_log, path = ?gguf_path_for_load, "loading GGUF");
            let mut file = std::fs::File::open(&gguf_path_for_load).context("open GGUF file")?;
            let content = gguf_file::Content::read(&mut file)
                .map_err(|e| anyhow::anyhow!("parse GGUF: {e}"))?;
            let architecture = content
                .metadata
                .get("general.architecture")
                .and_then(|v| v.to_string().ok().cloned())
                .unwrap_or_default();
            tracing::info!(architecture = %architecture, "GGUF architecture");
            match architecture.as_str() {
                "qwen3" => {
                    let weights =
                        QuantizedQwen3Weights::from_gguf(content, &mut file, &device_for_load)
                            .map_err(|e| anyhow::anyhow!("from_gguf qwen3: {e}"))?;
                    Ok(ModelArch::Qwen3Quantized(weights))
                }
                other => anyhow::bail!(
                    "unsupported GGUF architecture '{other}'; Stage 3 only supports qwen3"
                ),
            }
        })
        .await
        .context("blocking load task panicked")??;
        let loaded = Arc::new(LoadedModel {
            model_id: spec.model_id.clone(),
            arch: Arc::new(Mutex::new(arch)),
            tokenizer,
            device,
            quant: spec.quant.clone(),
            devices,
        });
        let mut models = self.models.write().await;
        models.insert(spec.model_id.clone(), loaded);
        tracing::info!(model = %spec.model_id, "model loaded");
        Ok(())
    }
    async fn unload_model(&self, model_id: &str) -> Result<()> {
        let mut models = self.models.write().await;
        if models.remove(model_id).is_none() {
            anyhow::bail!("model '{model_id}' not loaded");
        }
        tracing::info!(model = %model_id, "model unloaded");
        Ok(())
    }
    async fn inference_endpoint(&self, model_id: &str) -> Option<String> {
        let models = self.models.read().await;
        models.contains_key(model_id).then(|| self.bind_url.clone())
    }
 }
 /// Errors returned by `CandleHarness::chat_completion`. The
 /// `ModelNotLoaded` variant lets the HTTP handler map cleanly to 404
 /// without string-matching on anyhow messages.
 #[derive(Debug, thiserror::Error)]
 pub enum InferenceError {
    #[error("model '{0}' not loaded on this neuron")]
    ModelNotLoaded(String),
    #[error(transparent)]
    Other(#[from] anyhow::Error),
 }
 /// Apply the Qwen3 chat template:
 ///
 /// ```text
 /// <|im_start|>{role}\n{content}<|im_end|>\n
 /// ...
 /// <|im_start|>assistant\n
 /// ```
 ///
 /// The trailing `<|im_start|>assistant\n` cues the model to begin a turn.
 /// Non-text content parts (vision blocks) are joined as text only; full
 /// multimodal handling is out of scope for Stage 3.
 fn format_qwen3_prompt(messages: &[ChatMessage]) -> String {
    let mut prompt = String::new();
    for msg in messages {
        let content = match &msg.content {
            MessageContent::Text(s) => s.clone(),
            MessageContent::Parts(parts) => parts
                .iter()
                .filter_map(|p| p.get("text").and_then(|v| v.as_str()))
                .collect::<Vec<_>>()
                .join(""),
        };
        prompt.push_str("<|im_start|>");
        prompt.push_str(&msg.role);
        prompt.push('\n');
        prompt.push_str(&content);
        prompt.push_str("<|im_end|>\n");
    }
    prompt.push_str("<|im_start|>assistant\n");
    prompt
 }
 #[allow(clippy::too_many_arguments)]
 fn run_inference(
    arch: &mut ModelArch,
    device: &Device,
    prompt_tokens: &[u32],
    max_new: usize,
    temperature: f64,
    top_p: Option<f64>,
    seed: u64,
    eos_id: Option<u32>,
 ) -> Result<(Vec<u32>, String)> {
    let mut logits_processor = {
        let sampling = if temperature <= 0.0 {
            Sampling::ArgMax
        } else {
            match top_p {
                Some(p) => Sampling::TopP { p, temperature },
                None => Sampling::All { temperature },
            }
        };
        LogitsProcessor::from_sampling(seed, sampling)
    };
    let mut generated: Vec<u32> = Vec::new();
    let mut next_token = match arch {
        ModelArch::Qwen3Quantized(model) => {
            model.clear_kv_cache();
            let input = Tensor::new(prompt_tokens, device)?.unsqueeze(0)?;
            let logits = model.forward(&input, 0)?;
            let logits = logits.squeeze(0)?;
            logits_processor.sample(&logits)?
        }
    };
    if Some(next_token) == eos_id {
        return Ok((generated, "stop".into()));
    }
    generated.push(next_token);
    for index in 0..max_new.saturating_sub(1) {
        next_token = match arch {
            ModelArch::Qwen3Quantized(model) => {
                let input = Tensor::new(&[next_token], device)?.unsqueeze(0)?;
                let logits = model.forward(&input, prompt_tokens.len() + index)?;
                let logits = logits.squeeze(0)?;
                logits_processor.sample(&logits)?
            }
        };
        if Some(next_token) == eos_id {
            return Ok((generated, "stop".into()));
        }
        generated.push(next_token);
    }
    Ok((generated, "length".into()))
 }
 /// Streaming counterpart to `run_inference`. Emits chunks via `tx` as
 /// tokens are generated and exits on EOS, max_new, or receiver drop.
 ///
 /// Detokenization tracks the cumulative decoded prefix so each chunk's
 /// `content` delta is the substring appended since the last chunk —
 /// safe across BPE byte-fallback boundaries.
 #[allow(clippy::too_many_arguments)]
 fn run_inference_streaming(
    arch: &mut ModelArch,
    device: &Device,
    tokenizer: &Tokenizer,
    prompt_tokens: &[u32],
    max_new: usize,
    temperature: f64,
    top_p: Option<f64>,
    seed: u64,
    eos_id: Option<u32>,
    id: &str,
    created: u64,
    model_id: &str,
    tx: &mpsc::Sender<ChatCompletionChunk>,
 ) -> Result<()> {
    let mut logits_processor = {
        let sampling = if temperature <= 0.0 {
            Sampling::ArgMax
        } else {
            match top_p {
                Some(p) => Sampling::TopP { p, temperature },
                None => Sampling::All { temperature },
            }
        };
        LogitsProcessor::from_sampling(seed, sampling)
    };
    let mut all_tokens: Vec<u32> = Vec::new();
    let mut decoded_prefix = String::new();
    let mut finish_reason = "length".to_string();
    let mut next_token = match arch {
        ModelArch::Qwen3Quantized(model) => {
            model.clear_kv_cache();
            let input = Tensor::new(prompt_tokens, device)?.unsqueeze(0)?;
            let logits = model.forward(&input, 0)?;
            let logits = logits.squeeze(0)?;
            logits_processor.sample(&logits)?
        }
    };
    let emit_token = |all_tokens: &[u32], decoded_prefix: &mut String| -> Result<bool> {
        let full = tokenizer
            .decode(all_tokens, true)
            .map_err(|e| anyhow::anyhow!("decode: {e}"))?;
        if full.len() > decoded_prefix.len() {
            let delta = full[decoded_prefix.len()..].to_string();
            *decoded_prefix = full;
            let chunk = ChatCompletionChunk {
                id: id.into(),
                object: "chat.completion.chunk".into(),
                created,
                model: model_id.into(),
                choices: vec![ChunkChoice {
                    index: 0,
                    delta: json!({ "content": delta }),
                    finish_reason: None,
                    extra: serde_json::Value::Object(Default::default()),
                }],
                usage: None,
                extra: serde_json::Value::Object(Default::default()),
            };
            // blocking_send returns Err if the consumer hung up — signal
            // the caller to stop generating.
            if tx.blocking_send(chunk).is_err() {
                return Ok(false);
            }
        }
        Ok(true)
    };
    if Some(next_token) == eos_id {
        finish_reason = "stop".into();
    } else {
        all_tokens.push(next_token);
        if !emit_token(&all_tokens, &mut decoded_prefix)? {
            return Ok(());
        }
        for index in 0..max_new.saturating_sub(1) {
            next_token = match arch {
                ModelArch::Qwen3Quantized(model) => {
                    let input = Tensor::new(&[next_token], device)?.unsqueeze(0)?;
                    let logits = model.forward(&input, prompt_tokens.len() + index)?;
                    let logits = logits.squeeze(0)?;
                    logits_processor.sample(&logits)?
                }
            };
            if Some(next_token) == eos_id {
                finish_reason = "stop".into();
                break;
            }
            all_tokens.push(next_token);
            if !emit_token(&all_tokens, &mut decoded_prefix)? {
                return Ok(());
            }
        }
    }
    let final_chunk = ChatCompletionChunk {
        id: id.into(),
        object: "chat.completion.chunk".into(),
        created,
        model: model_id.into(),
        choices: vec![ChunkChoice {
            index: 0,
            delta: serde_json::Value::Object(Default::default()),
            finish_reason: Some(finish_reason),
            extra: serde_json::Value::Object(Default::default()),
        }],
        usage: None,
        extra: serde_json::Value::Object(Default::default()),
    };
    let _ = tx.blocking_send(final_chunk);
    Ok(())
 }
 fn unix_now_secs() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .map(|d| d.as_secs())
        .unwrap_or(0)
 }
 fn unix_subsec_nanos() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .map(|d| d.as_nanos() as u64)
        .unwrap_or(0)
 }
--- a/crates/neuron/src/harness/llamacpp.rs
+++ b/crates/neuron/src/harness/llamacpp.rs
@@ -0,0 +1 @@
 // llama.cpp harness implementation — Phase 11.
--- a/crates/neuron/src/harness/mistralrs.rs
+++ b/crates/neuron/src/harness/mistralrs.rs
@@ -0,0 +1,163 @@
 //! mistral.rs harness implementation.
 //!
 //! Wraps the mistral.rs HTTP API for model lifecycle management
 //! and optionally manages the process via systemd.
 use anyhow::Result;
 use async_trait::async_trait;
 use cortex_core::harness::{Harness, HarnessConfig, HarnessHealth, ModelInfo, ModelSpec};
 use reqwest::Client;
 use serde::Deserialize;
 pub struct MistralRsHarness {
    endpoint: String,
    systemd_unit: Option<String>,
    client: Client,
 }
 impl MistralRsHarness {
    pub fn new(endpoint: String, systemd_unit: Option<String>) -> Self {
        Self {
            endpoint,
            systemd_unit,
            client: Client::builder()
                .timeout(std::time::Duration::from_secs(30))
                .build()
                .expect("failed to build HTTP client"),
        }
    }
 }
 /// Response from mistral.rs `GET /v1/models`.
 #[derive(Debug, Deserialize)]
 struct ModelsResponse {
    data: Vec<ModelEntry>,
 }
 #[derive(Debug, Deserialize)]
 struct ModelEntry {
    id: String,
    #[serde(default)]
    status: Option<String>,
 }
 #[async_trait]
 impl Harness for MistralRsHarness {
    fn name(&self) -> &str {
        "mistralrs"
    }
    async fn start(&self, _config: &HarnessConfig) -> Result<()> {
        let Some(unit) = &self.systemd_unit else {
            anyhow::bail!("no systemd unit configured for mistralrs harness");
        };
        let output = tokio::process::Command::new("systemctl")
            .args(["start", unit])
            .output()
            .await?;
        if !output.status.success() {
            let stderr = String::from_utf8_lossy(&output.stderr);
            anyhow::bail!("systemctl start {unit} failed: {stderr}");
        }
        // Wait for the health endpoint to respond (up to 30s).
        let url = format!("{}/health", self.endpoint);
        for _ in 0..30 {
            tokio::time::sleep(std::time::Duration::from_secs(1)).await;
            if self.client.get(&url).send().await.is_ok() {
                tracing::info!(unit, "mistralrs started and healthy");
                return Ok(());
            }
        }
        anyhow::bail!("mistralrs started but health endpoint did not respond within 30s");
    }
    async fn stop(&self) -> Result<()> {
        let Some(unit) = &self.systemd_unit else {
            anyhow::bail!("no systemd unit configured for mistralrs harness");
        };
        let output = tokio::process::Command::new("systemctl")
            .args(["stop", unit])
            .output()
            .await?;
        if !output.status.success() {
            let stderr = String::from_utf8_lossy(&output.stderr);
            anyhow::bail!("systemctl stop {unit} failed: {stderr}");
        }
        Ok(())
    }
    async fn health(&self) -> HarnessHealth {
        let url = format!("{}/health", self.endpoint);
        let running = self.client.get(&url).send().await.is_ok();
        HarnessHealth {
            name: "mistralrs".into(),
            running,
            uptime_secs: None,
        }
    }
    async fn list_models(&self) -> Result<Vec<ModelInfo>> {
        let url = format!("{}/v1/models", self.endpoint);
        let resp = self.client.get(&url).send().await?;
        if !resp.status().is_success() {
            anyhow::bail!("GET /v1/models returned {}", resp.status());
        }
        let models_resp: ModelsResponse = resp.json().await?;
        Ok(models_resp
            .data
            .into_iter()
            .map(|m| ModelInfo {
                id: m.id,
                harness: "mistralrs".into(),
                status: m.status.unwrap_or_else(|| "loaded".into()),
                devices: vec![],
                vram_used_mb: None,
            })
            .collect())
    }
    async fn load_model(&self, spec: &ModelSpec) -> Result<()> {
        let url = format!("{}/v1/models/reload", self.endpoint);
        let resp = self
            .client
            .post(&url)
            .json(&serde_json::json!({ "model_id": spec.model_id }))
            .send()
            .await?;
        if !resp.status().is_success() {
            let body = resp.text().await.unwrap_or_default();
            anyhow::bail!("POST /v1/models/reload failed: {body}");
        }
        Ok(())
    }
    async fn unload_model(&self, model_id: &str) -> Result<()> {
        let url = format!("{}/v1/models/unload", self.endpoint);
        let resp = self
            .client
            .post(&url)
            .json(&serde_json::json!({ "model_id": model_id }))
            .send()
            .await?;
        if !resp.status().is_success() {
            let body = resp.text().await.unwrap_or_default();
            anyhow::bail!("POST /v1/models/unload failed: {body}");
        }
        Ok(())
    }
    async fn inference_endpoint(&self, _model_id: &str) -> Option<String> {
        // mistral.rs routes internally by model name in the request body,
        // so the inference endpoint is always the base URL.
        Some(self.endpoint.clone())
    }
 }
--- a/crates/neuron/src/harness/mod.rs
+++ b/crates/neuron/src/harness/mod.rs
@@ -1,22 +1,15 @@
 //! Harness registry — maps harness names to trait implementations.
-pub mod candle;
+pub mod llamacpp;
 pub mod mistralrs;
 use anyhow::Result;
 use cortex_core::harness::{Harness, HarnessConfig, ModelInfo, ModelSpec};
 use std::collections::HashMap;
 use std::sync::Arc;
 /// Registry of available harness implementations.
 ///
 /// Holds an `Arc<dyn Harness>` per harness for generic lifecycle dispatch
 /// (load/unload/list_models). When a candle harness is registered, a typed
 /// `Arc<CandleHarness>` is also cached so inference routes can bypass the
 /// dyn-Trait dispatch and reach harness-specific methods (chat completion,
 /// streaming, etc.).
 pub struct HarnessRegistry {
-    harnesses: HashMap<String, Arc<dyn Harness>>,
+    harnesses: HashMap<String, Box<dyn Harness>>,
    candle: Option<Arc<candle::CandleHarness>>,
 }
 impl Default for HarnessRegistry {
@@ -29,11 +22,10 @@ impl HarnessRegistry {
    pub fn new() -> Self {
        Self {
            harnesses: HashMap::new(),
            candle: None,
        }
    }
-    pub fn register(&mut self, harness: Arc<dyn Harness>) {
+    pub fn register(&mut self, harness: Box<dyn Harness>) {
        self.harnesses.insert(harness.name().to_string(), harness);
    }
@@ -42,12 +34,6 @@ impl HarnessRegistry {
        self.harnesses.keys().cloned().collect()
    }
    /// Typed handle to the candle harness, if registered. Used by inference
    /// routes that need methods beyond the `Harness` trait surface.
    pub fn candle(&self) -> Option<Arc<candle::CandleHarness>> {
        self.candle.clone()
    }
    /// List models from all registered harnesses.
    pub async fn list_all_models(&self) -> Result<Vec<ModelInfo>> {
        let mut all = Vec::new();
@@ -95,25 +81,19 @@ impl HarnessRegistry {
    }
    /// Build a registry from harness configs.
-    ///
+    pub fn from_configs(configs: &[HarnessConfig]) -> Self {
    /// `bind_url` is the URL where this neuron serves inference (its own
    /// listen address). In-process harnesses (currently the only kind)
    /// return this URL from `inference_endpoint`.
    pub fn from_configs(
        configs: &[HarnessConfig],
        bind_url: &str,
        settings: &crate::config::HarnessSettings,
    ) -> Self {
        let mut registry = Self::new();
        for config in configs {
            match config.name.as_str() {
-                "candle" => {
+                "mistralrs" => {
-                    let harness = Arc::new(candle::CandleHarness::new(
+                    if let Some(endpoint) = &config.endpoint {
-                        bind_url.to_string(),
+                        registry.register(Box::new(mistralrs::MistralRsHarness::new(
-                        settings.candle.hf_cache.clone(),
+                            endpoint.clone(),
-                    ));
+                            config.systemd_unit.clone(),
-                    registry.candle = Some(Arc::clone(&harness));
+                        )));
-                    registry.harnesses.insert("candle".into(), harness);
+                    } else {
                        tracing::warn!("mistralrs harness missing endpoint, skipping");
                    }
                }
                other => {
                    tracing::warn!(harness = other, "unknown harness type, skipping");
--- a/crates/neuron/src/lib.rs
+++ b/crates/neuron/src/lib.rs
@@ -3,4 +3,3 @@ pub mod config;
 pub mod discovery;
 pub mod harness;
 pub mod health;
 pub mod startup;
--- a/crates/neuron/src/main.rs
+++ b/crates/neuron/src/main.rs
@@ -1,6 +1,6 @@
 use anyhow::Result;
 use clap::Parser;
-use neuron::{api, config::NeuronConfig, discovery, harness::HarnessRegistry, health, startup};
+use neuron::{api, config::NeuronConfig, discovery, harness::HarnessRegistry, health};
 use std::sync::Arc;
 use std::time::Instant;
 use tokio::sync::RwLock;
@@ -37,7 +37,6 @@ async fn main() -> Result<()> {
    });
    let port = args.port.unwrap_or(cfg.port);
    let bind_url = format!("http://localhost:{port}");
    let start_time = Instant::now();
    tracing::info!("running hardware discovery");
@@ -48,18 +47,9 @@ async fn main() -> Result<()> {
        "discovery complete"
    );
-    // Build harness registry from config. In-process harnesses (candle)
+    // Build harness registry from config.
-    // need to know neuron's own bind URL so they can return it from
+    let registry = HarnessRegistry::from_configs(&cfg.harnesses);
    // inference_endpoint.
    let registry = HarnessRegistry::from_configs(&cfg.harnesses, &bind_url, &cfg.harness);
    discovery_result.harnesses = registry.names();
    let candle = registry.candle();
    // Activation: load default models before binding the listener.
    // Each load may take tens of seconds to several minutes depending
    // on model size and HF cache state — keep TimeoutStartSec in the
    // systemd unit generous enough to cover the slowest entry.
    startup::load_default_models(&registry, &cfg.default_models).await;
    let health_cache = Arc::new(health::HealthCache::new());
    health_cache
@@ -75,24 +65,13 @@ async fn main() -> Result<()> {
        discovery: discovery_result,
        health_cache,
        registry: RwLock::new(registry),
        candle,
    });
-    let app = api::neuron_routes().with_state(Arc::clone(&state));
+    let app = api::neuron_routes().with_state(state);
    let addr: std::net::SocketAddr = format!("0.0.0.0:{port}").parse()?;
    tracing::info!("neuron listening on {addr}");
    let listener = tokio::net::TcpListener::bind(addr).await?;
-    axum::serve(listener, app)
+    axum::serve(listener, app).await?;
        .with_graceful_shutdown(startup::shutdown_signal())
        .await?;
    // Deactivation: serve has returned (graceful shutdown signal
    // received and connections drained). Release CUDA contexts / VRAM
    // by unloading every model before exiting; systemd's TimeoutStopSec
    // bounds how long this phase may take.
    let registry = state.registry.read().await;
    startup::unload_all_models(&registry).await;
    tracing::info!("shutdown complete");
    Ok(())
 }
--- a/crates/neuron/src/startup.rs
+++ b/crates/neuron/src/startup.rs
@@ -1,97 +0,0 @@
 //! Activation- and deactivation-time orchestration.
 //!
 //! Wired from `main.rs` around the HTTP listener — activation runs
 //! before bind, deactivation runs after axum returns from its
 //! graceful-shutdown future. Kept in its own module so the logic is
 //! unit-testable without spinning up a full neuron process.
 use crate::harness::HarnessRegistry;
 use cortex_core::harness::ModelSpec;
 use std::time::Instant;
 use tokio::signal;
 /// Load each spec sequentially against the registry, treating
 /// individual failures as warnings rather than fatal errors.
 ///
 /// VRAM contention makes parallel loads risky; the sequential path is
 /// boring but correct. The function logs elapsed time per load so an
 /// operator can see which model is hogging activation.
 pub async fn load_default_models(registry: &HarnessRegistry, specs: &[ModelSpec]) {
    if specs.is_empty() {
        return;
    }
    tracing::info!(count = specs.len(), "loading default models");
    for spec in specs {
        let start = Instant::now();
        match registry.load_model(spec).await {
            Ok(()) => tracing::info!(
                model = %spec.model_id,
                elapsed_ms = start.elapsed().as_millis() as u64,
                "loaded default model"
            ),
            Err(e) => tracing::warn!(
                model = %spec.model_id,
                error = %e,
                elapsed_ms = start.elapsed().as_millis() as u64,
                "failed to load default model, continuing"
            ),
        }
    }
 }
 /// Future that resolves on SIGINT (Ctrl-C) or SIGTERM (systemd stop).
 ///
 /// Wired into `axum::serve(...).with_graceful_shutdown(shutdown_signal())`
 /// so the HTTP listener stops accepting new connections, lets in-flight
 /// requests drain, and then yields control back to main for cleanup.
 pub async fn shutdown_signal() {
    let ctrl_c = async {
        signal::ctrl_c().await.ok();
    };
    let terminate = async {
        signal::unix::signal(signal::unix::SignalKind::terminate())
            .expect("install SIGTERM handler")
            .recv()
            .await;
    };
    tokio::select! {
        _ = ctrl_c => tracing::info!("received SIGINT, shutting down"),
        _ = terminate => tracing::info!("received SIGTERM, shutting down"),
    }
 }
 /// Unload every model currently registered. Called from `main.rs` after
 /// axum's graceful shutdown future resolves, so CUDA contexts and VRAM
 /// are released before the process exits rather than left to the OS to
 /// reclaim. Per-model failures are logged and skipped — keep cleanup
 /// going even when one harness is unhealthy.
 pub async fn unload_all_models(registry: &HarnessRegistry) {
    let listed = match registry.list_all_models().await {
        Ok(m) => m,
        Err(e) => {
            tracing::warn!(error = %e, "failed to list models during shutdown");
            return;
        }
    };
    if listed.is_empty() {
        return;
    }
    tracing::info!(count = listed.len(), "unloading models for shutdown");
    for model in listed {
        let start = Instant::now();
        match registry.unload_model(&model.id).await {
            Ok(()) => tracing::info!(
                model = %model.id,
                elapsed_ms = start.elapsed().as_millis() as u64,
                "unloaded"
            ),
            Err(e) => tracing::warn!(
                model = %model.id,
                error = %e,
                "unload failed during shutdown"
            ),
        }
    }
 }
--- a/crates/neuron/tests/activation.rs
+++ b/crates/neuron/tests/activation.rs
@@ -1,56 +0,0 @@
 //! Activation-time behaviour: load_default_models continues past
 //! individual failures so a single broken catalogue entry doesn't
 //! prevent the rest of the fleet from starting.
 use cortex_core::harness::{HarnessConfig, ModelSpec};
 use neuron::config::HarnessSettings;
 use neuron::harness::HarnessRegistry;
 use neuron::startup;
 #[tokio::test]
 async fn test_load_default_models_skips_unknown_harness() {
    let registry = HarnessRegistry::from_configs(
        &[HarnessConfig {
            name: "candle".into(),
        }],
        "http://localhost:0",
        &HarnessSettings::default(),
    );
    // Both entries fail synchronously inside the registry — no network
    // call escapes (the harness lookup mismatches before hf-hub is
    // touched). The function should still return cleanly.
    let specs = vec![
        ModelSpec {
            model_id: "model-a".into(),
            harness: "no-such-harness".into(),
            quant: None,
            tensor_parallel: None,
            devices: None,
        },
        ModelSpec {
            model_id: "model-b".into(),
            harness: "no-such-harness".into(),
            quant: None,
            tensor_parallel: None,
            devices: None,
        },
    ];
    startup::load_default_models(&registry, &specs).await;
    let listed = registry
        .list_all_models()
        .await
        .expect("list_all_models should succeed");
    assert!(
        listed.is_empty(),
        "no models should be loaded after failed entries"
    );
 }
 #[tokio::test]
 async fn test_load_default_models_empty_is_noop() {
    let registry = HarnessRegistry::new();
    startup::load_default_models(&registry, &[]).await;
 }
--- a/crates/neuron/tests/api.rs
+++ b/crates/neuron/tests/api.rs
@@ -14,7 +14,6 @@ async fn spawn_neuron(discovery: DiscoveryResponse) -> String {
        discovery,
        health_cache,
        registry: RwLock::new(registry),
        candle: None,
    });
    let app = api::neuron_routes().with_state(state);
@@ -136,30 +135,56 @@ async fn test_models_empty_registry() {
    assert!(body.as_array().unwrap().is_empty());
 }
-/// Verify the candle harness registers, list is empty by default, and a
+/// Spawn a mock mistral.rs backend and a neuron with the mistralrs harness
-/// load attempt for an obviously-bogus model id returns a 4xx error
+/// pointing at it, then test the full model lifecycle through neuron's API.
 /// without crashing the daemon. Real load/unload exercising actual GGUF
 /// download is covered by `tests/candle_lifecycle.rs` (cuda-integration).
 #[tokio::test]
-async fn test_candle_harness_registers_and_rejects_bogus_model() {
+async fn test_models_via_mistralrs_harness() {
    use axum::routing::{get, post};
    use axum::{Json, Router};
    use cortex_core::harness::HarnessConfig;
-    use neuron::config::HarnessSettings;
+    use serde_json::Value;
-    let registry = HarnessRegistry::from_configs(
+    // Mock mistral.rs backend.
-        &[HarnessConfig {
+    let mock_app = Router::new()
-            name: "candle".into(),
+        .route(
-        }],
+            "/v1/models",
-        "http://localhost:13131",
+            get(|| async {
-        &HarnessSettings::default(),
+                Json(json!({
-    );
+                    "data": [
                        {"id": "test-model", "status": "loaded"},
                        {"id": "other-model", "status": "unloaded"}
                    ]
                }))
            }),
        )
        .route(
            "/v1/models/unload",
            post(|Json(_body): Json<Value>| async { Json(json!({"status": "ok"})) }),
        )
        .route(
            "/v1/models/reload",
            post(|Json(_body): Json<Value>| async { Json(json!({"status": "ok"})) }),
        );
    let mock_listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
    let mock_addr = mock_listener.local_addr().unwrap();
    tokio::spawn(async move {
        axum::serve(mock_listener, mock_app).await.unwrap();
    });
    let mock_url = format!("http://{mock_addr}");
    // Build neuron with mistralrs harness pointing at mock.
    let registry = HarnessRegistry::from_configs(&[HarnessConfig {
        name: "mistralrs".into(),
        endpoint: Some(mock_url.clone()),
        systemd_unit: None,
    }]);
    let candle = registry.candle();
    let health_cache = Arc::new(HealthCache::new());
    let state = Arc::new(NeuronState {
        discovery: fake_discovery(),
        health_cache,
        registry: RwLock::new(registry),
        candle,
    });
    let app = api::neuron_routes().with_state(state);
@@ -172,6 +197,7 @@ async fn test_candle_harness_registers_and_rejects_bogus_model() {
    let client = reqwest::Client::new();
    // GET /models — should return models from mock mistralrs.
    let resp = client
        .get(format!("{neuron_url}/models"))
        .send()
@@ -179,140 +205,45 @@ async fn test_candle_harness_registers_and_rejects_bogus_model() {
        .unwrap();
    assert_eq!(resp.status(), 200);
    let models: Vec<serde_json::Value> = resp.json().await.unwrap();
-    assert!(models.is_empty());
+    assert_eq!(models.len(), 2);
    assert_eq!(models[0]["id"], "test-model");
    assert_eq!(models[0]["harness"], "mistralrs");
    assert_eq!(models[0]["status"], "loaded");
    assert_eq!(models[1]["id"], "other-model");
    assert_eq!(models[1]["status"], "unloaded");
-    // Sending a wrong-harness spec should be rejected synchronously
+    // GET /models/test-model/endpoint — should return mock URL.
-    // without touching the network or the model registry.
+    let resp = client
        .get(format!("{neuron_url}/models/test-model/endpoint"))
        .send()
        .await
        .unwrap();
    assert_eq!(resp.status(), 200);
    let body: serde_json::Value = resp.json().await.unwrap();
    assert_eq!(body["url"], mock_url);
    // POST /models/unload — should succeed.
    let resp = client
        .post(format!("{neuron_url}/models/unload"))
        .json(&json!({"model_id": "test-model"}))
        .send()
        .await
        .unwrap();
    assert_eq!(resp.status(), 200);
    let body: serde_json::Value = resp.json().await.unwrap();
    assert_eq!(body["status"], "unloaded");
    // POST /models/load — should succeed.
    let resp = client
        .post(format!("{neuron_url}/models/load"))
        .json(&json!({"model_id": "definitely/not-real", "harness": "not-candle"}))
        .send()
        .await
        .unwrap();
    assert_eq!(resp.status(), 400);
    // Registry still empty.
    let resp = client
        .get(format!("{neuron_url}/models"))
        .send()
        .await
        .unwrap();
    let models: Vec<serde_json::Value> = resp.json().await.unwrap();
    assert!(models.is_empty());
 }
 /// `/v1/chat/completions` returns 503 when no candle harness is registered.
 #[tokio::test]
 async fn test_chat_completions_no_candle_harness() {
    let registry = HarnessRegistry::new();
    let health_cache = Arc::new(HealthCache::new());
    let state = Arc::new(NeuronState {
        discovery: fake_discovery(),
        health_cache,
        registry: RwLock::new(registry),
        candle: None,
    });
    let app = api::neuron_routes().with_state(state);
    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
    let addr = listener.local_addr().unwrap();
    tokio::spawn(async move {
        axum::serve(listener, app).await.unwrap();
    });
    let url = format!("http://{addr}");
    let resp = reqwest::Client::new()
        .post(format!("{url}/v1/chat/completions"))
        .json(&json!({
-            "model": "anything",
+            "model_id": "test-model",
-            "messages": [{"role": "user", "content": "hi"}]
+            "harness": "mistralrs"
        }))
        .send()
        .await
        .unwrap();
-    assert_eq!(resp.status(), 503);
+    assert_eq!(resp.status(), 200);
-}
+    let body: serde_json::Value = resp.json().await.unwrap();
-
+    assert_eq!(body["status"], "loaded");
 /// `/v1/chat/completions` returns 404 when the requested model isn't loaded.
 #[tokio::test]
 async fn test_chat_completions_model_not_loaded() {
    use cortex_core::harness::HarnessConfig;
    use neuron::config::HarnessSettings;
    let registry = HarnessRegistry::from_configs(
        &[HarnessConfig {
            name: "candle".into(),
        }],
        "http://localhost:0",
        &HarnessSettings::default(),
    );
    let candle = registry.candle();
    let health_cache = Arc::new(HealthCache::new());
    let state = Arc::new(NeuronState {
        discovery: fake_discovery(),
        health_cache,
        registry: RwLock::new(registry),
        candle,
    });
    let app = api::neuron_routes().with_state(state);
    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
    let addr = listener.local_addr().unwrap();
    tokio::spawn(async move {
        axum::serve(listener, app).await.unwrap();
    });
    let url = format!("http://{addr}");
    let resp = reqwest::Client::new()
        .post(format!("{url}/v1/chat/completions"))
        .json(&json!({
            "model": "definitely/not-loaded",
            "messages": [{"role": "user", "content": "hi"}]
        }))
        .send()
        .await
        .unwrap();
    assert_eq!(resp.status(), 404);
 }
 /// `/v1/chat/completions` with `stream: true` returns 404 when the
 /// model isn't loaded — same surface as the non-streaming path. The
 /// streaming code only kicks in once the model lookup succeeds.
 #[tokio::test]
 async fn test_chat_completions_streaming_model_not_loaded() {
    use cortex_core::harness::HarnessConfig;
    use neuron::config::HarnessSettings;
    let registry = HarnessRegistry::from_configs(
        &[HarnessConfig {
            name: "candle".into(),
        }],
        "http://localhost:0",
        &HarnessSettings::default(),
    );
    let candle = registry.candle();
    let health_cache = Arc::new(HealthCache::new());
    let state = Arc::new(NeuronState {
        discovery: fake_discovery(),
        health_cache,
        registry: RwLock::new(registry),
        candle,
    });
    let app = api::neuron_routes().with_state(state);
    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
    let addr = listener.local_addr().unwrap();
    tokio::spawn(async move {
        axum::serve(listener, app).await.unwrap();
    });
    let url = format!("http://{addr}");
    let resp = reqwest::Client::new()
        .post(format!("{url}/v1/chat/completions"))
        .json(&json!({
            "model": "definitely/not-loaded",
            "messages": [{"role": "user", "content": "hi"}],
            "stream": true
        }))
        .send()
        .await
        .unwrap();
    assert_eq!(resp.status(), 404);
 }
--- a/crates/neuron/tests/candle_lifecycle.rs
+++ b/crates/neuron/tests/candle_lifecycle.rs
@@ -1,87 +0,0 @@
 //! Real model load/unload lifecycle through the candle harness.
 //!
 //! Gated behind the `cuda-integration` feature because it downloads a
 //! real (small) GGUF from HuggingFace and materialises tensors on the
 //! configured device. Run on a host with network access and either a
 //! CUDA GPU (when built with `--features cuda`) or enough CPU RAM to
 //! hold the model.
 //!
 //! Usage:
 //!   cargo test -p neuron --features cuda-integration --test candle_lifecycle
 //!
 //! Optional environment variables:
 //!   NEURON_TEST_MODEL_ID — HuggingFace repo to load (default: a small
 //!     public Qwen3 GGUF repo).
 //!   NEURON_TEST_QUANT    — quant substring matched against GGUF
 //!     filenames (default: "Q4_K_M").
 //!   HF_HOME              — HuggingFace cache directory.
 #![cfg(feature = "cuda-integration")]
 use cortex_core::harness::{HarnessConfig, ModelSpec};
 use neuron::config::HarnessSettings;
 use neuron::harness::HarnessRegistry;
 use std::path::PathBuf;
 #[tokio::test]
 async fn test_candle_qwen3_load_unload_lifecycle() {
    let _ = tracing_subscriber::fmt()
        .with_test_writer()
        .with_env_filter("info,neuron=debug")
        .try_init();
    let model_id = std::env::var("NEURON_TEST_MODEL_ID")
        .unwrap_or_else(|_| "Qwen/Qwen3-0.6B-GGUF".to_string());
    let quant = std::env::var("NEURON_TEST_QUANT").unwrap_or_else(|_| "Q4_K_M".to_string());
    let mut settings = HarnessSettings::default();
    if let Ok(home) = std::env::var("HF_HOME") {
        settings.candle.hf_cache = Some(PathBuf::from(home));
    }
    let registry = HarnessRegistry::from_configs(
        &[HarnessConfig {
            name: "candle".into(),
        }],
        "http://localhost:13131",
        &settings,
    );
    let spec = ModelSpec {
        model_id: model_id.clone(),
        harness: "candle".into(),
        quant: Some(quant),
        tensor_parallel: None,
        devices: Some(vec![0]),
    };
    registry
        .load_model(&spec)
        .await
        .expect("load_model should succeed");
    let models = registry.list_all_models().await.expect("list_all_models");
    assert_eq!(models.len(), 1, "expected exactly one loaded model");
    assert_eq!(models[0].id, model_id);
    assert_eq!(models[0].harness, "candle");
    assert_eq!(models[0].status, "loaded");
    let url = registry.inference_endpoint(&model_id).await;
    assert_eq!(url, Some("http://localhost:13131".into()));
    // Re-loading the same model should be rejected.
    let again = registry.load_model(&spec).await;
    assert!(again.is_err(), "second load should error");
    registry
        .unload_model(&model_id)
        .await
        .expect("unload_model should succeed");
    let models = registry.list_all_models().await.expect("list_all_models");
    assert!(models.is_empty(), "registry should be empty after unload");
    // Unloading a model that isn't loaded should error.
    let err = registry.unload_model(&model_id).await;
    assert!(err.is_err(), "unload of missing model should error");
 }
--- a/crates/neuron/tests/shutdown.rs
+++ b/crates/neuron/tests/shutdown.rs
@@ -1,32 +0,0 @@
 //! Deactivation behaviour: unload_all_models tolerates an empty
 //! registry and continues past per-model unload failures.
 use cortex_core::harness::HarnessConfig;
 use neuron::config::HarnessSettings;
 use neuron::harness::HarnessRegistry;
 use neuron::startup;
 #[tokio::test]
 async fn test_unload_all_models_empty_registry_is_noop() {
    let registry = HarnessRegistry::new();
    startup::unload_all_models(&registry).await;
 }
 #[tokio::test]
 async fn test_unload_all_models_with_no_loaded_models() {
    let registry = HarnessRegistry::from_configs(
        &[HarnessConfig {
            name: "candle".into(),
        }],
        "http://localhost:0",
        &HarnessSettings::default(),
    );
    startup::unload_all_models(&registry).await;
    let listed = registry
        .list_all_models()
        .await
        .expect("list_all_models should still succeed after shutdown cleanup");
    assert!(listed.is_empty());
 }
--- a/data/cortex-firewalld.xml
+++ b/data/cortex-firewalld.xml
@@ -1,7 +0,0 @@
 <?xml version="1.0" encoding="utf-8"?>
 <service>
  <short>cortex</short>
  <description>Cortex — inference gateway for multi-node GPU clusters</description>
  <port protocol="tcp" port="31313"/>
  <port protocol="tcp" port="31314"/>
 </service>
--- a/data/neuron-firewalld.xml
+++ b/data/neuron-firewalld.xml
@@ -1,6 +0,0 @@
 <?xml version="1.0" encoding="utf-8"?>
 <service>
  <short>helexa-neuron</short>
  <description>Neuron — per-node GPU discovery and harness daemon for cortex</description>
  <port protocol="tcp" port="13131"/>
 </service>
--- a/data/neuron.service
+++ b/data/neuron.service
@@ -10,22 +10,6 @@ Restart=on-failure
 RestartSec=5
 User=neuron
 Group=neuron
 # /var/lib/neuron is the neuron user's $HOME — hf-hub writes its
 # default cache there (~/.cache/huggingface/hub). Without this directive
 # systemd doesn't create the directory and hf-hub downloads fail with
 # "fetch GGUF <file>: failed to create cache dir".
 StateDirectory=neuron
 StateDirectoryMode=0755
 # Loading default_models from neuron.toml happens before the HTTP
 # listener binds; large models can take many minutes to download and
 # materialise on first activation. systemd's default TimeoutStartSec
 # (90s) is far too short; allow 30 minutes.
 TimeoutStartSec=1800s
 # On stop, neuron drains in-flight requests then unloads every model
 # to release CUDA contexts cleanly. Allow generous time for big-model
 # unloads; systemd will SIGKILL after this bound.
 TimeoutStopSec=120s
 KillSignal=SIGTERM
 [Install]
 WantedBy=multi-user.target
--- a/models.example.toml
+++ b/models.example.toml
@@ -6,7 +6,7 @@
 [[models]]
 id = "your-org/large-model"
-harness = "candle"
+harness = "mistralrs"
 quant = "Q4_K_M"
 vram_mb = 19000
 min_devices = 2
@@ -15,7 +15,7 @@ pinned_on = ["gpu-large"]
 [[models]]
 id = "your-org/medium-model"
-harness = "candle"
+harness = "mistralrs"
 quant = "Q6_K"
 vram_mb = 12000
 min_devices = 1
@@ -23,7 +23,7 @@ pinned_on = ["gpu-medium"]
 [[models]]
 id = "your-org/embedding-model"
-harness = "candle"
+harness = "mistralrs"
 quant = "Q8_0"
 vram_mb = 8000
 min_devices = 1
--- a/neuron.example.toml
+++ b/neuron.example.toml
@@ -3,38 +3,14 @@
 # Copy to /etc/neuron/neuron.toml and adjust for your environment.
 #
 # Environment variable overrides use NEURON_ prefix with __ separators:
-#   NEURON_PORT=13131
+#   NEURON_PORT=9090
-port = 13131
+port = 9090
 # -- Harnesses ---------------------------------------------------------------
-# Each [[harnesses]] entry enables an inference engine. Currently only
+# Each [[harnesses]] entry declares an inference engine managed by neuron.
 # "candle" is supported — it runs in-process and uses huggingface/candle
 # for inference on local CUDA devices (or CPU when CUDA is unavailable).
 [[harnesses]]
-name = "candle"
+name = "mistralrs"
-
+endpoint = "http://localhost:8080"
-# -- Candle harness settings -------------------------------------------------
+systemd_unit = "mistralrs.service"
 # Optional tuning for the candle harness.
 [harness.candle]
 # HuggingFace cache directory for model weights. When unset, hf-hub's
 # default (~/.cache/huggingface) is used.
 # hf_cache = "/var/lib/neuron/hf-cache"
 # -- Default models ----------------------------------------------------------
 # Models listed here are loaded automatically when the neuron service
 # activates. Loading is sequential — a slow or failing entry doesn't
 # block the rest of the fleet, but it does push out the time before
 # neuron starts serving HTTP, so keep the list short. Operators can
 # load additional models on demand via POST /models/load.
 #
 # Make sure data/neuron.service's TimeoutStartSec is generous enough to
 # cover the slowest entry's first-time download + materialisation.
 # [[default_models]]
 # model_id = "Qwen/Qwen3-0.6B-GGUF"
 # harness = "candle"
 # quant = "Q4_K_M"
 # devices = [0]
--- a/helexa-neuron.spec
+++ b/helexa-neuron.spec
@@ -1,10 +1,7 @@
-Name:           helexa-neuron
+Name:           neuron
-Version:        0.1.16
+Version:        0.1.8
 Release:        1%{?dist}
 Summary:        Per-node GPU discovery and harness management daemon for cortex
 # Package name disambiguates from Fedora's existing "neuron" package
 # (NEURON neural simulation environment from Yale). Binary, systemd
 # unit, and system user are still called "neuron" for brevity.
 License:        GPL-3.0-or-later
 URL:            https://git.lair.cafe/helexa/cortex
@@ -24,7 +21,6 @@ BuildRequires:  systemd-rpm-macros
 Requires(pre):  shadow-utils
 Requires:       systemd
 Requires:       firewalld-filesystem
 # systemd-rpm-macros ships a unit dep generator that parses User=/Group=
 # from our .service file and emits Requires: user(neuron)/group(neuron).
@@ -37,9 +33,8 @@ Provides:       user(neuron)
 %description
 Neuron is a per-node daemon for cortex inference clusters. It discovers
-local GPU hardware via nvidia-smi, runs in-process inference via
+local GPU hardware via nvidia-smi, manages inference harnesses (mistral.rs,
-huggingface/candle, and exposes an HTTP API for model lifecycle
+llama.cpp), and exposes an HTTP API for model lifecycle management.
 management (load, unload, list, inference endpoint).
 %prep
 %autosetup
@@ -60,7 +55,6 @@ cargo build --release -p neuron
 install -Dm755 target/release/neuron %{buildroot}%{_bindir}/neuron
 install -Dm644 data/neuron.service %{buildroot}%{_unitdir}/neuron.service
 install -Dm644 data/neuron-sysusers.conf %{buildroot}%{_sysusersdir}/neuron.conf
 install -Dm644 data/neuron-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/helexa-neuron.xml
 install -dm755 %{buildroot}%{_sysconfdir}/neuron
 install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml
@@ -82,20 +76,9 @@ install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml
 %{_bindir}/neuron
 %{_unitdir}/neuron.service
 %{_sysusersdir}/neuron.conf
 %{_prefix}/lib/firewalld/services/helexa-neuron.xml
 %dir %{_sysconfdir}/neuron
 %config(noreplace) %{_sysconfdir}/neuron/neuron.toml
 %changelog
 * Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.16-1
 - chore: ignore local deploy script
 - chore: move default ports out of common-collision ranges
 - ci: drop actions/cache for cargo registry and target
 * Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.14-1
 - ci: publish both packages to a single helexa/helexa COPR project
 - fix(rpm): rename neuron package to helexa-neuron
 - ci: commit generated %changelog entries back to main
 * Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1
 - Initial package
--- a/rpm/cortex-prerelease.spec
+++ b/rpm/cortex-prerelease.spec
@@ -1,102 +0,0 @@
 # Prebuilt-binary spec for cortex.
 #
 # Unlike cortex.spec (which builds from source via cargo), this spec
 # wraps a pre-built `cortex` binary produced by an upstream CI job and
 # packages it for rpm.lair.cafe. The %build phase is a no-op.
 #
 # Required defines at rpmbuild time:
 #   cortex_version    e.g. "0.1.16"
 #   cortex_prerelease e.g. "0.1.20260518gitabcdef0"   (used as Release)
 %global _build_id_links none
 %global debug_package %{nil}
 %global __strip /usr/bin/true
 %{!?cortex_version: %global cortex_version 0.0.0}
 %if 0%{?cortex_prerelease:1}
 %global cortex_release %{cortex_prerelease}
 %else
 %global cortex_release 1
 %endif
 Name:           cortex
 Version:        %{cortex_version}
 Release:        %{cortex_release}%{?dist}
 Summary:        Inference gateway for multi-node GPU clusters (prebuilt)
 License:        GPL-3.0-or-later
 URL:            https://git.lair.cafe/helexa/cortex
 Source0:        cortex
 Source1:        cortex.service
 Source2:        cortex-sysusers.conf
 Source3:        cortex-firewalld.xml
 Source4:        cortex.example.toml
 Source5:        models.example.toml
 Source6:        LICENSE
 ExclusiveArch:  x86_64
 Requires(pre):  shadow-utils
 Requires:       systemd
 Requires:       firewalld-filesystem
 Provides:       user(cortex)
 %description
 Cortex is a Rust reverse-proxy that sits in front of multiple neuron
 inference daemons and presents a unified OpenAI and Anthropic
 compatible API surface.
 This package wraps a binary built upstream in CI; the source-build
 spec (cortex.spec) remains available for stable releases.
 %prep
 cp %{SOURCE0} ./cortex
 cp %{SOURCE1} .
 cp %{SOURCE2} .
 cp %{SOURCE3} .
 cp %{SOURCE4} .
 cp %{SOURCE5} .
 cp %{SOURCE6} .
 %build
 # Already built in the upstream CI build job.
 %install
 install -Dm755 cortex %{buildroot}%{_bindir}/cortex
 install -Dm644 cortex.service %{buildroot}%{_unitdir}/cortex.service
 install -Dm644 cortex-sysusers.conf %{buildroot}%{_sysusersdir}/cortex.conf
 install -Dm644 cortex-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/cortex.xml
 install -dm755 %{buildroot}%{_sysconfdir}/cortex
 install -Dm644 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml
 install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
 %pre
 getent group cortex >/dev/null || groupadd -r cortex
 getent passwd cortex >/dev/null || \
    useradd -r -g cortex -d /var/lib/cortex -s /sbin/nologin \
        -c "Cortex inference gateway" cortex
 %post
 %systemd_post cortex.service
 %preun
 %systemd_preun cortex.service
 %postun
 %systemd_postun_with_restart cortex.service
 %files
 %license LICENSE
 %{_bindir}/cortex
 %{_unitdir}/cortex.service
 %{_sysusersdir}/cortex.conf
 %{_prefix}/lib/firewalld/services/cortex.xml
 %dir %{_sysconfdir}/cortex
 %config(noreplace) %{_sysconfdir}/cortex/cortex.toml
 %config(noreplace) %{_sysconfdir}/cortex/models.toml
 %changelog
 * Mon May 18 2026 Gitea Actions <actions@git.lair.cafe> - %{cortex_version}-%{cortex_release}
 - Prerelease build from upstream CI binary.
--- a/rpm/helexa-neuron-prerelease.spec
+++ b/rpm/helexa-neuron-prerelease.spec
@@ -1,122 +0,0 @@
 # Prebuilt-binary spec for helexa-neuron flavoured by CUDA compute capability.
 #
 # Unlike helexa-neuron.spec (which builds from source via cargo), this
 # spec wraps a pre-built `neuron-{flavour}` binary produced by an
 # upstream CI job and packages it for rpm.lair.cafe. The %build phase
 # is a no-op.
 #
 # Required defines at rpmbuild time:
 #   neuron_version    e.g. "0.1.16"
 #   neuron_flavour    e.g. "ada", "blackwell"  — matches the CI build
 #                     matrix's compute_cap label.
 #   neuron_prerelease e.g. "0.1.20260518gitabcdef0"   (used as Release)
 #
 # One flavour can be installed at a time on a given host; flavour
 # packages Conflict with each other.
 %global _build_id_links none
 %global debug_package %{nil}
 %global __strip /usr/bin/true
 %{!?neuron_version: %global neuron_version 0.0.0}
 %{!?neuron_flavour: %global neuron_flavour blackwell}
 %if 0%{?neuron_prerelease:1}
 %global neuron_release %{neuron_prerelease}
 %else
 %global neuron_release 1
 %endif
 Name:           helexa-neuron-%{neuron_flavour}
 Version:        %{neuron_version}
 Release:        %{neuron_release}%{?dist}
 Summary:        Per-node GPU inference daemon (candle, %{neuron_flavour} flavour)
 License:        GPL-3.0-or-later
 URL:            https://git.lair.cafe/helexa/cortex
 Source0:        neuron-%{neuron_flavour}
 Source1:        neuron.service
 Source2:        neuron-sysusers.conf
 Source3:        neuron-firewalld.xml
 Source4:        neuron.example.toml
 Source5:        LICENSE
 ExclusiveArch:  x86_64
 # Binary links against the CUDA runtime, cuDNN, NCCL, etc. Suppress
 # auto-detected exact soname deps — users may have CUDA from various
 # sources (rpmfusion, nvidia-direct) at different compatible versions;
 # a runtime dlopen failure surfaces a clearer error than rpm dep
 # resolution would.
 %global __requires_exclude ^lib(cuda|cudart|cudnn|cublas|cublasLt|curand|nvrtc|nccl)
 Requires(pre):  shadow-utils
 Requires:       systemd
 Requires:       firewalld-filesystem
 Provides:       helexa-neuron = %{neuron_version}-%{neuron_release}
 Provides:       user(neuron)
 # Mutual exclusion across flavours and the source-build variant.
 Conflicts:      helexa-neuron
 Conflicts:      helexa-neuron-ada
 Conflicts:      helexa-neuron-ampere
 Conflicts:      helexa-neuron-blackwell
 # (The Conflicts: with self is filtered by rpm at install time.)
 %description
 Neuron is the per-node daemon for cortex inference clusters. It
 discovers local GPU hardware via nvidia-smi, runs in-process
 inference via huggingface/candle, and exposes an HTTP API for model
 lifecycle management (load, unload, list, inference endpoint).
 This is the %{neuron_flavour} flavour, built for that CUDA compute
 capability. Install the flavour matching the GPUs on this host.
 %prep
 cp %{SOURCE0} ./neuron
 cp %{SOURCE1} .
 cp %{SOURCE2} .
 cp %{SOURCE3} .
 cp %{SOURCE4} .
 cp %{SOURCE5} .
 %build
 # Already built in the upstream CI build job (with --features cuda).
 %install
 install -Dm755 neuron %{buildroot}%{_bindir}/neuron
 install -Dm644 neuron.service %{buildroot}%{_unitdir}/neuron.service
 install -Dm644 neuron-sysusers.conf %{buildroot}%{_sysusersdir}/neuron.conf
 install -Dm644 neuron-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/helexa-neuron.xml
 install -dm755 %{buildroot}%{_sysconfdir}/neuron
 install -Dm644 neuron.example.toml %{buildroot}%{_sysconfdir}/neuron/neuron.toml
 %pre
 getent group neuron >/dev/null || groupadd -r neuron
 getent passwd neuron >/dev/null || \
    useradd -r -g neuron -d /var/lib/neuron -s /sbin/nologin \
        -G video,render \
        -c "Neuron GPU node daemon" neuron
 %post
 %systemd_post neuron.service
 %preun
 %systemd_preun neuron.service
 %postun
 %systemd_postun_with_restart neuron.service
 %files
 %license LICENSE
 %{_bindir}/neuron
 %{_unitdir}/neuron.service
 %{_sysusersdir}/neuron.conf
 %{_prefix}/lib/firewalld/services/helexa-neuron.xml
 %dir %{_sysconfdir}/neuron
 %config(noreplace) %{_sysconfdir}/neuron/neuron.toml
 %changelog
 * Mon May 18 2026 Gitea Actions <actions@git.lair.cafe> - %{neuron_version}-%{neuron_release}
 - Prerelease build from upstream CI binary (%{neuron_flavour} flavour).
--- a/rpm/rpmmacros
+++ b/rpm/rpmmacros
@@ -1 +0,0 @@
 %_openpgp_sign_id @GPG_NAME@
--- a/script/deploy.sh
+++ b/script/deploy.sh
@@ -1,195 +0,0 @@
 #!/bin/env bash
 #
 # Rolling deploy across the helexa fleet, driven by asset/manifest.yml.
 # Installs / upgrades cortex on the gateway host and the appropriate
 # helexa-neuron-<flavour> package on each neuron host, then restarts
 # their services.
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 REPO_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
 MANIFEST="${REPO_DIR}/asset/manifest.yml"
 if [[ ! -f "${MANIFEST}" ]]; then
    echo "fatal: manifest not found at ${MANIFEST}" >&2
    exit 1
 fi
 # Parse the manifest with yq. NOTE: this expects the pip-installed yq
 # (a jq wrapper using jq syntax) — `pip install yq`. The Fedora rpm
 # `yq` is mikefarah/yq and uses different (yaml-native) syntax; if a
 # host has that one instead these queries will fail.
 cortex_host=$(yq -r '.cortex.host' "${MANIFEST}")
 # Emit one TAB-separated 'host\tflavour' line per neuron.
 mapfile -t neuron_entries < <(
    yq -r '.neurons[] | .host + "\t" + .flavour' "${MANIFEST}"
 )
 # Return the installed package's "version-release" string, or
 # "(not installed)" when rpm reports the package as absent. Capture
 # rpm's output into a variable so its "package X is not installed"
 # stdout message (rpm writes that to stdout, not stderr, when -q fails)
 # doesn't leak into the result.
 installed_nvr() {
    local host="$1" pkg="$2"
    local nvr
    if nvr=$(ssh "${host}" "rpm -q --qf '%{version}-%{release}' ${pkg} 2>/dev/null"); then
        echo "${nvr}"
    else
        echo "(not installed)"
    fi
 }
 # Ensure the rpm.lair.cafe unstable repo is configured AND enabled on
 # the remote host.
 #
 # The upstream .repo file at https://rpm.lair.cafe/lair-cafe-unstable.repo
 # ships with `enabled=0` so a host that just fetched it won't start
 # pulling unstable packages by accident. We have to explicitly flip
 # enabled=1 via `dnf config-manager setopt`. Both addrepo and setopt
 # are idempotent.
 #
 # Non-fatal — if either step fails the subsequent `dnf install` will
 # surface a clearer diagnostic on its own.
 ensure_lair_repo() {
    local host="$1"
    if ! ssh "${host}" "test -f /etc/yum.repos.d/lair-cafe-unstable.repo" 2>/dev/null; then
        echo "[${host}] adding rpm.lair.cafe unstable repo"
        if ! ssh "${host}" sudo dnf config-manager addrepo \
            --from-repofile=https://rpm.lair.cafe/lair-cafe-unstable.repo \
            >/dev/null 2>&1; then
            echo "[${host}] WARNING: failed to add lair.cafe repo file (proceeding anyway)"
            return 0
        fi
    fi
    # The .repo file ships enabled=0; flip it on. Cheap, idempotent.
    if ! ssh "${host}" sudo dnf config-manager setopt \
        lair-cafe-unstable.enabled=1 >/dev/null 2>&1; then
        echo "[${host}] WARNING: failed to enable lair-cafe-unstable (proceeding anyway)"
    fi
 }
 # True when the named package needs to be installed or upgraded on the
 # remote host — either it's not present, or a newer version exists in
 # the repo. False only when the installed version is current.
 #
 # `dnf check-update <pkg>` returns 0 when the package isn't installed
 # at all (there's nothing to update), so we have to probe with rpm -q
 # first to distinguish "absent" from "current". Other dnf failures
 # collapse into "needs update" so the subsequent install step surfaces
 # the real diagnostic rather than this check swallowing it.
 needs_update() {
    local host="$1" pkg="$2"
    # Not installed → needs work.
    if ! ssh "${host}" "rpm -q ${pkg}" >/dev/null 2>&1; then
        return 0
    fi
    # Installed; ask dnf whether the repo has something newer.
    if ssh "${host}" sudo dnf check-update --refresh -q "${pkg}" >/dev/null 2>&1; then
        return 1
    else
        return 0
    fi
 }
 # ---------------------------------------------------------------------------
 # cortex (gateway)
 # ---------------------------------------------------------------------------
 ensure_lair_repo "${cortex_host}"
 cortex_nvr=$(installed_nvr "${cortex_host}" cortex)
 if needs_update "${cortex_host}" cortex; then
    echo "[${cortex_host}] cortex update available (current: ${cortex_nvr})"
    # Stop the service only if the unit file exists — fresh installs
    # don't have it, and `systemctl stop` on a missing unit returns
    # non-zero, which would otherwise short-circuit the install branch
    # under set -e.
    if ssh "${cortex_host}" "[ ! -f /usr/lib/systemd/system/cortex.service ] || sudo systemctl stop cortex.service"; then
        echo "[${cortex_host}] stopped cortex service"
        if dnf_output=$(ssh "${cortex_host}" sudo dnf install --refresh --allowerasing -y cortex 2>&1); then
            cortex_nvr=$(installed_nvr "${cortex_host}" cortex)
            echo "[${cortex_host}] installed/upgraded cortex to ${cortex_nvr}"
        else
            echo "[${cortex_host}] failed to install/upgrade cortex:"
            echo "${dnf_output}" | sed "s/^/[${cortex_host}]   /"
        fi
    else
        echo "[${cortex_host}] failed to stop cortex service"
    fi
 else
    echo "[${cortex_host}] cortex is up to date (${cortex_nvr})"
    ssh "${cortex_host}" sudo systemctl stop cortex.service || true
 fi
 # Sync cortex.toml whether the package was upgraded or not — the config
 # can change without a package bump.
 if rsync \
    --archive \
    --compress \
    --rsync-path 'sudo rsync' \
    --chown root:root \
    --chmod 644 \
    "${REPO_DIR}/cortex.toml" \
    "${cortex_host}:/etc/cortex/cortex.toml"; then
    echo "[${cortex_host}] sync'd cortex.toml"
 else
    echo "[${cortex_host}] failed to sync cortex.toml"
 fi
 ssh "${cortex_host}" sudo systemctl daemon-reload
 if ssh "${cortex_host}" systemctl is-active --quiet cortex.service; then
    echo "[${cortex_host}] cortex service is active"
 elif ssh "${cortex_host}" sudo systemctl start cortex.service; then
    echo "[${cortex_host}] started cortex service"
 else
    echo "[${cortex_host}] failed to start cortex service"
 fi
 # ---------------------------------------------------------------------------
 # neuron (per-host, flavour from manifest)
 # ---------------------------------------------------------------------------
 for entry in "${neuron_entries[@]}"; do
    IFS=$'\t' read -r neuron_host neuron_flavour <<< "${entry}"
    package="helexa-neuron-${neuron_flavour}"
    ensure_lair_repo "${neuron_host}"
    neuron_nvr=$(installed_nvr "${neuron_host}" "${package}")
    if needs_update "${neuron_host}" "${package}"; then
        echo "[${neuron_host}] ${package} update available (current: ${neuron_nvr})"
        if ssh "${neuron_host}" "[ ! -f /usr/lib/systemd/system/neuron.service ] || sudo systemctl stop neuron.service"; then
            echo "[${neuron_host}] stopped neuron service"
            # --allowerasing lets dnf swap out a previously-installed
            # bare helexa-neuron or a different flavour without manual
            # intervention. The Conflicts: clauses in the spec ensure
            # only one flavour is ever resident.
            if dnf_output=$(ssh "${neuron_host}" sudo dnf install --refresh --allowerasing -y "${package}" 2>&1); then
                neuron_nvr=$(installed_nvr "${neuron_host}" "${package}")
                echo "[${neuron_host}] installed/upgraded ${package} to ${neuron_nvr}"
                # Ensure firewalld allows neuron port
                ssh "${neuron_host}" "sudo firewall-cmd --query-service=helexa-neuron --quiet 2>/dev/null || sudo firewall-cmd --add-service=helexa-neuron --permanent && sudo firewall-cmd --reload" 2>/dev/null || true
                if ssh "${neuron_host}" "sudo systemctl daemon-reload && sudo systemctl start neuron.service"; then
                    echo "[${neuron_host}] started neuron service"
                else
                    echo "[${neuron_host}] failed to start neuron service"
                fi
            else
                echo "[${neuron_host}] failed to install ${package}:"
                echo "${dnf_output}" | sed "s/^/[${neuron_host}]   /"
            fi
        else
            echo "[${neuron_host}] failed to stop neuron service"
        fi
    else
        echo "[${neuron_host}] ${package} is up to date (${neuron_nvr})"
        if ssh "${neuron_host}" systemctl is-active --quiet neuron.service; then
            echo "[${neuron_host}] neuron service is active"
        elif ssh "${neuron_host}" sudo systemctl start neuron.service; then
            echo "[${neuron_host}] started neuron service"
        else
            echo "[${neuron_host}] failed to start neuron service"
        fi
    fi
 done
--- a/script/generate-packages-json.py
+++ b/script/generate-packages-json.py
@@ -1,154 +0,0 @@
 #!/usr/bin/env python3
 """Parse RPM repodata and emit a packages.json manifest for the UI."""
 import argparse
 import gzip
 import json
 import os
 import subprocess
 import sys
 import xml.etree.ElementTree as ET
 from datetime import datetime, timezone
 RPM_NS = "http://linux.duke.edu/metadata/common"
 OTHER_NS = "http://linux.duke.edu/metadata/other"
 REPO_NS = "http://linux.duke.edu/metadata/repo"
 def find_repodata_file(repodata_dir, data_type):
    """Read repomd.xml and return the path to a specific data type's file."""
    repomd_path = os.path.join(repodata_dir, "repomd.xml")
    tree = ET.parse(repomd_path)
    root = tree.getroot()
    for data in root.findall(f"{{{REPO_NS}}}data"):
        if data.get("type") == data_type:
            location = data.find(f"{{{REPO_NS}}}location")
            if location is not None:
                href = location.get("href", "")
                return os.path.join(os.path.dirname(repodata_dir), href)
    return None
 def open_compressed(path):
    """Open a gzip or zstd compressed file for reading."""
    if path.endswith(".zst"):
        result = subprocess.run(
            ["zstdcat", path], capture_output=True, check=True
        )
        import io
        return io.BytesIO(result.stdout)
    else:
        return gzip.open(path, "rb")
 def parse_primary(repodata_dir):
    """Parse primary.xml.{gz,zst} and return package metadata."""
    path = find_repodata_file(repodata_dir, "primary")
    if not path:
        print("error: primary metadata not found in repomd.xml", file=sys.stderr)
        sys.exit(1)
    packages = {}
    with open_compressed(path) as f:
        tree = ET.parse(f)
    for pkg in tree.getroot().findall(f"{{{RPM_NS}}}package"):
        if pkg.get("type") != "rpm":
            continue
        name = pkg.findtext(f"{{{RPM_NS}}}name", "")
        version_el = pkg.find(f"{{{RPM_NS}}}version")
        ver = version_el.get("ver", "") if version_el is not None else ""
        rel = version_el.get("rel", "") if version_el is not None else ""
        arch = pkg.findtext(f"{{{RPM_NS}}}arch", "")
        size_el = pkg.find(f"{{{RPM_NS}}}size")
        size = int(size_el.get("package", "0")) if size_el is not None else 0
        time_el = pkg.find(f"{{{RPM_NS}}}time")
        build_time = int(time_el.get("build", "0")) if time_el is not None else 0
        location_el = pkg.find(f"{{{RPM_NS}}}location")
        filename = os.path.basename(location_el.get("href", "")) if location_el is not None else ""
        key = f"{name}-{ver}-{rel}"
        packages[key] = {
            "name": name,
            "version": ver,
            "release": rel,
            "arch": arch,
            "summary": pkg.findtext(f"{{{RPM_NS}}}summary", ""),
            "size": size,
            "buildTime": build_time,
            "rpmFilename": filename,
            "changelog": [],
        }
    return packages
 def parse_other(repodata_dir, packages):
    """Parse other.xml.gz and attach changelog entries to packages."""
    path = find_repodata_file(repodata_dir, "other")
    if not path:
        return
    with open_compressed(path) as f:
        tree = ET.parse(f)
    for pkg in tree.getroot().findall(f"{{{OTHER_NS}}}package"):
        name = pkg.get("name", "")
        version_el = pkg.find(f"{{{OTHER_NS}}}version")
        ver = version_el.get("ver", "") if version_el is not None else ""
        rel = version_el.get("rel", "") if version_el is not None else ""
        key = f"{name}-{ver}-{rel}"
        if key not in packages:
            continue
        for entry in pkg.findall(f"{{{OTHER_NS}}}changelog"):
            packages[key]["changelog"].append({
                "author": entry.get("author", ""),
                "date": int(entry.get("date", "0")),
                "text": (entry.text or "").strip(),
            })
 def main():
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--repodata-dir",
        required=True,
        help="path to the repodata/ directory",
    )
    parser.add_argument(
        "--output",
        required=True,
        help="path to write packages.json",
    )
    parser.add_argument(
        "--base-url",
        required=True,
        help="public base URL for the repo (e.g. https://rpm.lair.cafe/fedora/43/x86_64)",
    )
    args = parser.parse_args()
    packages = parse_primary(args.repodata_dir)
    parse_other(args.repodata_dir, packages)
    manifest = {
        "generated": datetime.now(timezone.utc).isoformat(),
        "baseUrl": args.base_url,
        "packages": list(packages.values()),
    }
    with open(args.output, "w") as f:
        json.dump(manifest, f, indent=2)
    print(f"wrote {len(packages)} packages to {args.output}")
 if __name__ == "__main__":
    main()
--- a/script/validate-neuron.sh
+++ b/script/validate-neuron.sh
@@ -1,141 +0,0 @@
 #!/bin/env bash
 #
 # End-to-end smoke test for a deployed neuron.
 #
 # Confirms the daemon is reachable, loads a small public Qwen3 GGUF,
 # fires a reasoning probe at /v1/chat/completions, and prints the
 # answer. Used to validate the candle harness on a real GPU host
 # before trusting it for production traffic, and as a regression test
 # after pushing new neuron builds.
 #
 # Usage:
 #   script/validate-neuron.sh [host] [model_id] [quant]
 #
 # Defaults:
 #   host     = beast.hanzalova.internal
 #   model_id = unsloth/Qwen3-0.6B-GGUF  (official Qwen3-*-GGUF repos
 #              ship Q8_0 only; unsloth's mirror ships the full Q-spectrum
 #              including Q4_K_M)
 #   quant    = Q4_K_M
 set -euo pipefail
 HOST="${1:-beast.hanzalova.internal}"
 MODEL_ID="${2:-unsloth/Qwen3-0.6B-GGUF}"
 QUANT="${3:-Q4_K_M}"
 PORT="${NEURON_PORT:-13131}"
 BASE="http://${HOST}:${PORT}"
 # Reasoning probe — concrete, low-temperature answer that small models
 # can still get right. "Paris" is a strong signal of basic competence
 # beyond gibberish.
 PROBE_PROMPT='What is the capital of France? Respond with the city name only, no punctuation.'
 EXPECT_SUBSTR='Paris'
 MAX_TOKENS=32
 # /models/load is synchronous — neuron blocks the response until the
 # hf-hub download + GGUF parse + tensor materialisation is done. A
 # fresh 0.6B-Q4_K_M is ~400 MB; on a slow link or cold cache that's
 # easily a minute. Pick a generous ceiling.
 LOAD_TIMEOUT=600
 INFER_TIMEOUT=120
 say() { printf '[%s] %s\n' "${HOST}" "$*"; }
 die() { say "FAIL: $*"; exit 1; }
 probe_health() {
    curl --silent --fail --max-time 5 "${BASE}/health" >/dev/null \
        || die "neuron not reachable at ${BASE}/health"
 }
 list_loaded_ids() {
    curl --silent --fail "${BASE}/models" | yq -r '.[].id'
 }
 is_loaded() {
    list_loaded_ids 2>/dev/null | grep -Fxq "${MODEL_ID}"
 }
 trigger_load() {
    say "POST /models/load ${MODEL_ID} (quant=${QUANT}, device=[0])"
    say "  (synchronous; may take a minute on first run while HF downloads)"
    local payload
    payload=$(cat <<EOF
 {
    "model_id": "${MODEL_ID}",
    "harness": "candle",
    "quant": "${QUANT}",
    "devices": [0]
 }
 EOF
    )
    # --write-out captures the response code on a separate line so we
    # can surface a real diagnostic instead of relying on --fail.
    local resp http_code body
    resp=$(curl --silent --show-error --max-time "${LOAD_TIMEOUT}" \
        --write-out '\n__HTTP__%{http_code}' \
        -X POST "${BASE}/models/load" \
        -H 'content-type: application/json' \
        --data "${payload}") || die "curl /models/load failed: $?"
    http_code=$(echo "${resp}" | grep -oP '(?<=__HTTP__)\d+$' | tail -1)
    body=$(echo "${resp}" | sed '$ s/__HTTP__.*$//')
    if [[ "${http_code}" != "200" ]]; then
        die "load returned HTTP ${http_code}: ${body}"
    fi
    say "load returned ${http_code}: ${body}"
 }
 run_probe() {
    say "POST /v1/chat/completions (probe: ${PROBE_PROMPT})"
    local payload
    payload=$(yq -n -c \
        --arg model "${MODEL_ID}" \
        --arg content "${PROBE_PROMPT}" \
        --argjson tokens "${MAX_TOKENS}" \
        '{
            model: $model,
            messages: [{role: "user", content: $content}],
            temperature: 0.1,
            max_tokens: $tokens
        }')
    local resp http_code body
    resp=$(curl --silent --show-error --max-time "${INFER_TIMEOUT}" \
        --write-out '\n__HTTP__%{http_code}' \
        -X POST "${BASE}/v1/chat/completions" \
        -H 'content-type: application/json' \
        --data "${payload}") || die "curl /v1/chat/completions failed: $?"
    http_code=$(echo "${resp}" | grep -oP '(?<=__HTTP__)\d+$' | tail -1)
    body=$(echo "${resp}" | sed '$ s/__HTTP__.*$//')
    if [[ "${http_code}" != "200" ]]; then
        die "inference returned HTTP ${http_code}: ${body}"
    fi
    echo "${body}"
 }
 say "validating neuron at ${BASE}"
 probe_health
 say "/health OK"
 if is_loaded; then
    say "${MODEL_ID} already loaded"
 else
    trigger_load
 fi
 raw=$(run_probe)
 echo "---"
 echo "${raw}" | yq -r '.'
 echo "---"
 content=$(echo "${raw}" | yq -r '.choices[0].message.content // empty')
 if [[ -z "${content}" ]]; then
    die "no content in chat completion response"
 fi
 say "assistant said: ${content}"
 if echo "${content}" | grep -qiF "${EXPECT_SUBSTR}"; then
    say "PASS — response contains expected substring '${EXPECT_SUBSTR}'"
    exit 0
 else
    die "response did not contain '${EXPECT_SUBSTR}'"
 fi
Author	SHA1	Message	Date
rob thijssen	e874c3483d	fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires Some checks failed CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details CI / Format, lint, build, test (push) Has been cancelled Details Diagnosing the persistent "Nothing to do" on v0.1.10 surfaced that removing %attr(,,name) from %files wasn't enough. systemd-rpm-macros ships its own rpm dep generator (/usr/lib/rpm/systemd.req) that parses User=/Group= directives from every .service file the package ships and emits Requires: user(NAME)/group(NAME) accordingly. Rpmbuild log from v0.1.10 shows these Requires are still emitted even after the %attr removal. Meanwhile the sysusers provides-generator emits group(NAME) in both unversioned and versioned forms, but only a versioned user(NAME) = <base64> when the u-line has GECOS/home/shell fields. The asymmetry leaves Requires: user(NAME) unresolvable. Add explicit Provides: user(NAME) back to both specs, with a comment documenting the actual cause (systemd unit parsing, not file attrs) so the next person touching these specs doesn't repeat the mistake. Why monsoon didn't hit this: it creates its user in %pre via groupadd/useradd (not sysusers.d), so no Provides are generated at all — matching the Requires: user(monsoon) by luck of the rpm solver treating unknown symbols as soft-fails for that path. Ours went through the sysusers Provides code path and hit the asymmetry instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:30:55 +03:00
rob thijssen	2caaae018a	ci: migrate rpm changelog generation to reusable action Replace the local .gitea/scripts/generate-rpm-changelog.sh with the shared composite action at https://git.lair.cafe/actions/rpm-changelog@v1. Behaviour is identical — collect commits since the previous v* tag, filter bump-version and merge noise, prepend a dated entry to the spec — but the logic now lives in one place that other projects can consume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:23:45 +03:00
rob thijssen	18d00001cf	ci: auto-generate rpm changelog entry per release On every tag push, build a %changelog entry from the git log since the previous v* tag and prepend it to each spec. Stops the initial entry from drifting further and catches bogus-date / stale-version warnings automatically since the generated date always matches the day the CI runs. The generator drops "chore: bump version" commits (bot-authored, noisy in user-facing changelogs) and merge commits. Author defaults to the gitea-actions identity but can be overridden via CHANGELOG_AUTHOR env var if a human release is desired. Requires fetch-depth: 0 on checkout so git describe can see prior tags and git log can reach them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:04:36 +03:00
rob thijssen	ad1442c096	fix(rpm): correct weekday in changelog entry April 15 2026 was a Wednesday, not Tuesday. rpmbuild validates the day-of-week against the date and warns on mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 14:58:40 +03:00
		`@@ -0,0 +1 @@`
							`// llama.cpp harness implementation — Phase 11.`