feat(tp): --tp-smoke CLI subcommand + remote validation script
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 4m56s
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s

Adds a one-shot diagnostic that exercises the lower half of the TP
stack — WorkerPool::spawn, init_nccl, nccl_sanity_check — in isolation
from model load and inference. Runs N-1 worker subprocesses (rank 0
stays in this process), joins them in an NCCL communicator on the
specified CUDA devices, all_reduces a sentinel 1u32 per rank, verifies
the observed_sum equals world_size on every rank, then shuts down.

Output is `status=ok` on stdout (plus key=value lines for tp_size and
cuda_devices) when every check passes, non-zero exit + tracing on
stderr otherwise. The smoke command is diagnostic-only and not exposed
through the daemon HTTP API.

script/tp-smoke.sh wraps it with an ssh invocation against a fleet
host (default beast — the only host with 2 GPUs) and asserts the
status line, mirroring the validate-neuron.sh ergonomics.

This is step 1 of the TP test plan. A failure here means TP cannot
work on the host at all; step 2 (Stage 7b-iv) wires real model load
and inference through the same WorkerPool primitives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 19:40:25 +03:00
parent 96d8755245
commit 9b8bd146f6
2 changed files with 138 additions and 5 deletions

60
script/tp-smoke.sh Executable file
View File

@@ -0,0 +1,60 @@
#!/bin/env bash
#
# TP smoke test against a deployed neuron host.
#
# SSHes into the target host and runs `neuron --tp-smoke --tp-size N
# --cuda-devices ...` directly — no HTTP API involved. The smoke
# subcommand spawns N-1 worker subprocesses, joins them in an NCCL
# communicator, runs one AllReduce(Sum) of `1u32` across every rank, and
# verifies the observed sum equals world_size on every rank.
#
# This validates the lower-half of the TP stack (NCCL + IPC topology +
# subprocess lifecycle) without touching model load, inference, or HTTP.
# A failure here means the host cannot run any TP model and there is no
# point debugging the higher layers.
#
# Usage:
# script/tp-smoke.sh [host] [tp_size] [cuda_devices]
#
# Defaults:
# host = beast.hanzalova.internal (only fleet host with 2 GPUs)
# tp_size = 2
# cuda_devices = 0,1
set -euo pipefail
HOST="${1:-beast.hanzalova.internal}"
TP_SIZE="${2:-2}"
CUDA_DEVICES="${3:-0,1}"
say() { printf '[%s] %s\n' "${HOST}" "$*" >&2; }
die() { say "FAIL: $*"; exit 1; }
say "running neuron --tp-smoke --tp-size ${TP_SIZE} --cuda-devices ${CUDA_DEVICES}"
# Run as root via sudo because:
# - cuda contexts under a user account require either the nvidia
# uvm/peer devices to be world-readable or the user to be in a
# priviliged group (neither is true on stock fc43);
# - the installed binary lives at /usr/bin/neuron with no setuid;
# Running through root is the simplest path that matches how
# systemd-managed neuron sees the GPUs in production.
#
# The smoke command is read-only — it allocates a transient NCCL comm
# and a 1u32 buffer per rank, then tears it all down.
if ! ssh -o BatchMode=yes "${HOST}" \
sudo /usr/bin/neuron \
--tp-smoke \
--tp-size "${TP_SIZE}" \
--cuda-devices "${CUDA_DEVICES}" 2>&1 | tee /tmp/tp-smoke-"${HOST}".log
then
die "tp-smoke exited non-zero (see /tmp/tp-smoke-${HOST}.log)"
fi
# Final stdout line is `status=ok` on success.
if grep -q '^status=ok$' /tmp/tp-smoke-"${HOST}".log; then
say "PASS — NCCL handshake + AllReduce sanity check OK across ${TP_SIZE} ranks"
exit 0
else
die "no status=ok line in output"
fi