All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>