Stage 7a-i: TP worker lifecycle scaffolding
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m9s
build-prerelease / Build neuron-ada (push) Successful in 4m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m9s
build-prerelease / Build neuron-ada (push) Successful in 4m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
Leader → worker process plumbing for tensor parallelism. The neuron
binary picks up two modes: default (the existing daemon, axum + HTTP)
and `--worker` (a bare RPC loop driven over stdin/stdout). The leader
spawns one worker per non-zero NCCL rank via tokio::process::Command
on the same binary path (production: /proc/self/exe; tests:
env!("CARGO_BIN_EXE_neuron")) and talks to each over newline-
delimited JSON.
Protocol (harness/tp/rpc.rs) is serde-tagged from the start —
WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and
WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both
`#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c
is purely additive; unknown ops on the wire fail to parse (verified
in unit tests).
7a-i scope:
- WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as
subprocesses, captures stdin/stdout, kills on drop.
- ping_all() round-trips a Ping to every worker and validates the
returned rank.
- shutdown() sends Shutdown to each worker, awaits Bye, reaps.
- Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and
NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii
binary speaking the same wire is a drop-in replacement (the kind
field signals "real NCCL lands in the next commit").
- CandleHarness::load_model refuses tensor_parallel > 1 with a clear
message until 7b is in.
Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/
ping/shutdown for 2- and 3-worker pools, plus the
not_implemented_7a_i contract test for Init. Seven rpc serde unit
tests assert the wire shape (op tags, field names, unknown-op
rejection). All pass on the dev host; no CUDA required.
Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the
existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on
beast's 2×5090.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
130
crates/neuron/tests/tp_worker_lifecycle.rs
Normal file
130
crates/neuron/tests/tp_worker_lifecycle.rs
Normal file
@@ -0,0 +1,130 @@
|
||||
//! Stage 7a-i: confirm the TP worker subprocess lifecycle round-trips.
|
||||
//!
|
||||
//! Spawns two worker subprocesses via the leader→worker stdio RPC,
|
||||
//! pings each, and cleanly shuts them down. No CUDA required —
|
||||
//! `Init` and `NcclSanityCheck` are stubbed in 7a-i, so this test
|
||||
//! runs on any host the workspace builds on.
|
||||
|
||||
use neuron::harness::tp::{WorkerPool, rpc::WorkerResponse};
|
||||
|
||||
/// Path to the neuron binary built by cargo for this test process.
|
||||
/// cargo populates `CARGO_BIN_EXE_neuron` at compile time for sibling-
|
||||
/// binary tests; production paths in main.rs use `/proc/self/exe`.
|
||||
const NEURON_BIN: &str = env!("CARGO_BIN_EXE_neuron");
|
||||
|
||||
/// Two workers (so we spawn one subprocess: rank 0 is in-process,
|
||||
/// rank 1 is the child). Verify the spawned worker responds to Ping
|
||||
/// with its own identity, then shut it down cleanly.
|
||||
#[tokio::test]
|
||||
async fn test_spawn_ping_shutdown() {
|
||||
// cuda_devices: rank 0 → device 0 (leader, unused here),
|
||||
// rank 1 → device 1 (worker; not actually opened in 7a-i).
|
||||
let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 2, &[0, 1])
|
||||
.await
|
||||
.expect("spawn worker pool");
|
||||
|
||||
let pongs = pool.ping_all().await.expect("ping all workers");
|
||||
assert_eq!(pongs.len(), 1, "expected one Pong (rank 1 only)");
|
||||
match &pongs[0] {
|
||||
WorkerResponse::Pong {
|
||||
rank,
|
||||
world_size,
|
||||
cuda_device,
|
||||
} => {
|
||||
assert_eq!(*rank, 1);
|
||||
assert_eq!(*world_size, 2);
|
||||
assert_eq!(*cuda_device, 1);
|
||||
}
|
||||
other => panic!("expected Pong, got {other:?}"),
|
||||
}
|
||||
|
||||
pool.shutdown().await.expect("clean shutdown");
|
||||
}
|
||||
|
||||
/// Three workers — exercise the loop in `ping_all` / `shutdown`.
|
||||
#[tokio::test]
|
||||
async fn test_spawn_three_workers() {
|
||||
let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 3, &[0, 1, 2])
|
||||
.await
|
||||
.expect("spawn worker pool");
|
||||
|
||||
let pongs = pool.ping_all().await.expect("ping all workers");
|
||||
assert_eq!(pongs.len(), 2, "expected two Pongs (ranks 1 and 2)");
|
||||
for (i, resp) in pongs.iter().enumerate() {
|
||||
match resp {
|
||||
WorkerResponse::Pong {
|
||||
rank,
|
||||
world_size,
|
||||
cuda_device,
|
||||
} => {
|
||||
let expected_rank = (i + 1) as u32;
|
||||
assert_eq!(*rank, expected_rank);
|
||||
assert_eq!(*world_size, 3);
|
||||
assert_eq!(*cuda_device, expected_rank);
|
||||
}
|
||||
other => panic!("expected Pong, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
pool.shutdown().await.expect("clean shutdown");
|
||||
}
|
||||
|
||||
/// 7a-i's Init/NcclSanityCheck handlers return an error rather than
|
||||
/// silently no-op, so the leader can tell the difference between
|
||||
/// "haven't implemented yet" and "succeeded vacuously". Confirm the
|
||||
/// shape so 7a-ii's replacement is a drop-in (same wire op names).
|
||||
#[tokio::test]
|
||||
async fn test_init_returns_not_implemented_in_7a_i() {
|
||||
use neuron::harness::tp::rpc::WorkerRequest;
|
||||
use std::process::Stdio;
|
||||
use tokio::io::{AsyncBufReadExt, AsyncWriteExt, BufReader};
|
||||
use tokio::process::Command;
|
||||
|
||||
// Spawn a single worker by hand to send Init directly (the pool's
|
||||
// public API doesn't expose Init yet — that lands in 7a-ii).
|
||||
let mut child = Command::new(NEURON_BIN)
|
||||
.arg("--worker")
|
||||
.arg("--rank")
|
||||
.arg("1")
|
||||
.arg("--tp-size")
|
||||
.arg("2")
|
||||
.arg("--cuda-device")
|
||||
.arg("1")
|
||||
.stdin(Stdio::piped())
|
||||
.stdout(Stdio::piped())
|
||||
.stderr(Stdio::null())
|
||||
.kill_on_drop(true)
|
||||
.spawn()
|
||||
.expect("spawn worker");
|
||||
|
||||
let mut stdin = child.stdin.take().expect("stdin");
|
||||
let stdout = child.stdout.take().expect("stdout");
|
||||
let mut lines = BufReader::new(stdout).lines();
|
||||
|
||||
let req = WorkerRequest::Init {
|
||||
comm_id: "ff".repeat(128),
|
||||
};
|
||||
let mut payload = serde_json::to_string(&req).unwrap();
|
||||
payload.push('\n');
|
||||
stdin.write_all(payload.as_bytes()).await.unwrap();
|
||||
stdin.flush().await.unwrap();
|
||||
|
||||
let reply = lines
|
||||
.next_line()
|
||||
.await
|
||||
.expect("read line")
|
||||
.expect("got line");
|
||||
let resp: WorkerResponse = serde_json::from_str(&reply).expect("parse reply");
|
||||
match resp {
|
||||
WorkerResponse::Error { kind, .. } => {
|
||||
assert_eq!(kind, "not_implemented_7a_i");
|
||||
}
|
||||
other => panic!("expected Error{{kind=not_implemented_7a_i}}, got {other:?}"),
|
||||
}
|
||||
|
||||
// Clean shutdown.
|
||||
stdin.write_all(b"{\"op\":\"shutdown\"}\n").await.unwrap();
|
||||
stdin.flush().await.unwrap();
|
||||
let _ = lines.next_line().await; // Bye
|
||||
let _ = child.wait().await;
|
||||
}
|
||||
Reference in New Issue
Block a user