refactor(neuron): phase 4 — model loads move onto the device worker
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -307,6 +307,26 @@ pub struct LlamaDense {
|
||||
}
|
||||
|
||||
impl LlamaDense {
|
||||
/// Constructor used by the dispatch-side loader. Keeps the field
|
||||
/// names private while letting the worker thread build a
|
||||
/// `LlamaDense` from already-loaded weights without going through
|
||||
/// async candle code.
|
||||
pub(crate) fn from_parts(
|
||||
model: llama_dense::Llama,
|
||||
cache: llama_dense::Cache,
|
||||
config: llama_dense::Config,
|
||||
dtype: DType,
|
||||
device: Device,
|
||||
) -> Self {
|
||||
Self {
|
||||
model,
|
||||
cache,
|
||||
config,
|
||||
dtype,
|
||||
device,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn forward(&mut self, input: &Tensor, offset: usize) -> Result<Tensor> {
|
||||
Ok(self.model.forward(input, offset, &mut self.cache)?)
|
||||
}
|
||||
@@ -348,7 +368,7 @@ const DENSE_SUPPORTED_MODEL_TYPES: &[&str] = &["llama", "qwen3", "qwen3_5", "qwe
|
||||
/// The result message names the model_type we saw, the supported set,
|
||||
/// and points at the files an operator (or future contributor) needs
|
||||
/// to touch to grow the supported set.
|
||||
fn check_dense_config_supported(config_json: &str, model_id: &str) -> Result<()> {
|
||||
pub(crate) fn check_dense_config_supported(config_json: &str, model_id: &str) -> Result<()> {
|
||||
let v: serde_json::Value = serde_json::from_str(config_json)
|
||||
.with_context(|| format!("parse config.json for '{model_id}' as JSON"))?;
|
||||
let model_type = v.get("model_type").and_then(|x| x.as_str()).unwrap_or("");
|
||||
@@ -1547,42 +1567,47 @@ impl Harness for CandleHarness {
|
||||
let devices = spec.devices.clone().unwrap_or_else(|| vec![0]);
|
||||
let device = Self::pick_device(&devices)?;
|
||||
|
||||
// Dispatch by source format: GGUF (pre-quantized, single-GPU
|
||||
// only path) vs safetensors dense (bf16/fp16; the path that
|
||||
// grows TP support). `spec.quant` is the signal — Some means
|
||||
// the operator picked a quantized GGUF; None means dense.
|
||||
let (tokenizer_path, arch) = if spec.quant.is_some() {
|
||||
self.load_arch_gguf(spec, &device).await?
|
||||
} else {
|
||||
self.load_arch_dense(spec, &device).await?
|
||||
};
|
||||
|
||||
let tokenizer = Tokenizer::from_file(&tokenizer_path)
|
||||
.map_err(|e| anyhow::anyhow!("load tokenizer: {e}"))?;
|
||||
|
||||
// Worker thread for the chosen device. CPU loads (CUDA
|
||||
// unavailable / not requested) skip the worker — there's no
|
||||
// context to own. For CUDA loads, the arch is transferred
|
||||
// into the worker's slab now so the inference path can
|
||||
// reference it via the returned `ArchHandle`. The explicit
|
||||
// type annotation lets the no-cuda build resolve `None` to
|
||||
// the right `Option<Arc<DeviceWorkerHandle>>` type.
|
||||
// Phase 4: load directly on the worker thread for CUDA;
|
||||
// legacy spawn_blocking + Arc<Mutex<>> only for CPU. Resolve
|
||||
// hf-hub paths up front (always async), then either dispatch
|
||||
// a load Job (CUDA) or call the legacy local loader (CPU).
|
||||
let worker: Option<Arc<super::device_worker::DeviceWorkerHandle>> = match &device {
|
||||
#[cfg(feature = "cuda")]
|
||||
Device::Cuda(_) => Some(self.ensure_device_worker(devices[0]).await?),
|
||||
_ => None,
|
||||
};
|
||||
let (arch_local, arch_handle) = match &worker {
|
||||
Some(w) => {
|
||||
|
||||
let (tokenizer_path, arch_local, arch_handle) = if let Some(w) = &worker {
|
||||
// CUDA path: resolve, then load in the worker.
|
||||
if spec.quant.is_some() {
|
||||
let (gguf_path, tokenizer_path) = self.resolve_files(spec).await?;
|
||||
let handle = w
|
||||
.transfer_in(Box::new(arch))
|
||||
.load_gguf(gguf_path, spec.model_id.clone())
|
||||
.await
|
||||
.map_err(|e| anyhow::anyhow!("transfer arch into device worker: {e}"))?;
|
||||
(None, Some(handle))
|
||||
.map_err(|e| anyhow::anyhow!("worker load_gguf: {e}"))?;
|
||||
(tokenizer_path, None, Some(handle))
|
||||
} else {
|
||||
let (config_path, tokenizer_path, safetensors_paths) =
|
||||
self.resolve_dense_files(spec).await?;
|
||||
let handle = w
|
||||
.load_dense(config_path, safetensors_paths, spec.model_id.clone())
|
||||
.await
|
||||
.map_err(|e| anyhow::anyhow!("worker load_dense: {e}"))?;
|
||||
(tokenizer_path, None, Some(handle))
|
||||
}
|
||||
None => (Some(Arc::new(Mutex::new(arch))), None),
|
||||
} else {
|
||||
// CPU path: legacy spawn_blocking + Arc<Mutex<ModelArch>>.
|
||||
let (tokenizer_path, arch) = if spec.quant.is_some() {
|
||||
self.load_arch_gguf(spec, &device).await?
|
||||
} else {
|
||||
self.load_arch_dense(spec, &device).await?
|
||||
};
|
||||
(tokenizer_path, Some(Arc::new(Mutex::new(arch))), None)
|
||||
};
|
||||
|
||||
let tokenizer = Tokenizer::from_file(&tokenizer_path)
|
||||
.map_err(|e| anyhow::anyhow!("load tokenizer: {e}"))?;
|
||||
|
||||
let loaded = Arc::new(LoadedModel {
|
||||
model_id: spec.model_id.clone(),
|
||||
arch: arch_local,
|
||||
|
||||
Reference in New Issue
Block a user