Some checks failed
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Failing after 59s
build-prerelease / Build neuron-blackwell (push) Successful in 3m30s
CI / Test (push) Successful in 4m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m16s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
First slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the infrastructure for a dedicated OS thread per CUDA device that owns the device's `CudaContext` for the daemon's lifetime, and routes the 8 async-context `device_vram_mb()` call sites in candle.rs through it. What this phase changes: - New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs). `DeviceWorkerHandle::spawn(idx)` creates a named OS thread (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel. Replies cross back via `tokio::sync::oneshot::Sender` so async callers await without parking a tokio worker. - Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward, ClearKv, NCCL init/sanity, and load variants. - `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at load time by a new `CandleHarness::ensure_device_worker(idx)` method that lazily spawns + caches one worker per device index. - Per-model `query_vram()` convenience method on both struct types so the 8 call sites in chat_completion / chat_completion_stream / chat_completion_tp_inner / chat_completion_tp_stream become `loaded.query_vram().await` (or `tp.query_vram().await`) — same field values logged, just sourced from the owner thread instead of the caller thread. What this phase doesn't touch (yet): - Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`. Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP forward + NCCL bring-up; Phase 4 moves the loads and deletes the now- unused `device_vram_mb` / `cuda_mem_mb` helpers. - Public API — unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical shapes. Tests: - 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query → shutdown round-trip, thread naming, post-shutdown submit returns `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across a Shutdown. CPU build (the only one CI runs) is enough to exercise channel mechanics. - All 37 lib tests + all integration tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>