Some checks failed
CI / Format (push) Failing after 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Failing after 3m33s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Failing after 4m19s
build-prerelease / Build neuron-ada (push) Failing after 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.
NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
opens a CudaContext on the configured device, calls Comm::from_rank
with the supplied (rank, world_size, id). NCCL blocks until every
rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
the Comm across spawn_blocking boundaries (NCCL is move-safe; only
concurrent op issuance is unsafe).
WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.
Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.
Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
"cuda_feature_not_enabled" on the local dev box (no CUDA), or
accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
the real init + sanity round-trip; gated on the cuda-integration
feature so default CI doesn't try to NCCL.
Verifiable on beast (2× RTX 5090):
cargo test -p neuron --features cuda-integration \
--test tp_worker_lifecycle_cuda
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
77 lines
2.2 KiB
TOML
77 lines
2.2 KiB
TOML
[package]
|
|
name = "neuron"
|
|
version.workspace = true
|
|
edition.workspace = true
|
|
license.workspace = true
|
|
|
|
[lib]
|
|
name = "neuron"
|
|
path = "src/lib.rs"
|
|
|
|
[[bin]]
|
|
name = "neuron"
|
|
path = "src/main.rs"
|
|
|
|
[features]
|
|
default = []
|
|
# Enables CUDA acceleration in candle and the cudarc/nccl bindings the
|
|
# TP worker pool uses. Without this feature, candle compiles for CPU
|
|
# only, Device::new_cuda calls fall back to CPU, and TP Init/sanity
|
|
# requests return Error{kind="cuda_feature_not_enabled"}.
|
|
cuda = [
|
|
"candle-core/cuda",
|
|
"candle-core/nccl",
|
|
"candle-nn/cuda",
|
|
"candle-transformers/cuda",
|
|
"dep:cudarc",
|
|
]
|
|
# Use cuDNN for convolution / attention kernels. Requires CUDA.
|
|
cudnn = [
|
|
"cuda",
|
|
"candle-core/cudnn",
|
|
"candle-nn/cudnn",
|
|
"candle-transformers/cudnn",
|
|
]
|
|
# FlashAttention kernels. Requires CUDA.
|
|
flash-attn = [
|
|
"cuda",
|
|
"candle-transformers/flash-attn",
|
|
]
|
|
# Reserved for GPU-only integration tests in later stages.
|
|
cuda-integration = ["cuda"]
|
|
|
|
[dependencies]
|
|
cortex-core.workspace = true
|
|
tokio.workspace = true
|
|
axum.workspace = true
|
|
serde.workspace = true
|
|
serde_json.workspace = true
|
|
reqwest.workspace = true
|
|
tracing.workspace = true
|
|
tracing-subscriber.workspace = true
|
|
anyhow.workspace = true
|
|
async-trait.workspace = true
|
|
clap.workspace = true
|
|
thiserror.workspace = true
|
|
futures.workspace = true
|
|
tokio-stream.workspace = true
|
|
figment.workspace = true
|
|
toml.workspace = true
|
|
|
|
# candle for in-process inference. CUDA support is gated behind the
|
|
# crate's `cuda` feature (default off) so the workspace builds on
|
|
# non-CUDA hosts and CI runners.
|
|
candle-core = "0.10.2"
|
|
candle-nn = "0.10.2"
|
|
candle-transformers = "0.10.2"
|
|
# Direct dep on cudarc (matching candle's transitive version) so the
|
|
# TP worker pool can call cudarc::nccl::{Comm, Id} directly. Gated on
|
|
# the `cuda` feature; same toolchain requirement as candle's CUDA path.
|
|
cudarc = { version = "0.19", optional = true, default-features = false, features = ["nccl", "cuda-version-from-build-system"] }
|
|
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
|
|
hf-hub = { version = "0.4", features = ["tokio"] }
|
|
|
|
[dev-dependencies]
|
|
tokio = { workspace = true, features = ["test-util"] }
|
|
reqwest.workspace = true
|