feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
CI / Test (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Successful in 4m51s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s

Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded
Qwen3) end to end so a non-streaming chat completion can run across
multiple GPUs via NCCL.

RPC additions (tp/rpc.rs):
- LoadDenseShard{model_id, config_json, safetensors_paths}
- GenerateStep{model_id, tokens, offset}
- ClearKvCache{model_id}
- UnloadModel{model_id}
- LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded

Worker side (tp/worker.rs):
- WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>`
  keyed by model_id. LoadDenseShard mmaps safetensors via
  ShardedVarBuilder (only this rank's slice materialises), builds the
  TP model with the rank's NCCL Comm cloned from NcclState.
- GenerateStep runs the rank-local forward; the resulting logits are
  dropped (only the leader's are used for sampling). The forward's
  value here is the NCCL collectives inside the row-parallel layers
  letting the leader's rank-0 forward make progress.

Pool side (tp/mod.rs):
- WorkerPool::load_dense_shard fans LoadDenseShard out to every worker,
  builds rank 0's shard on the leader via spawn_blocking with a fresh
  SendComm wrapper at the move boundary (Comm is !Send at the type
  level), collects per-rank LoadDenseShardOk. Returns the leader's
  Arc<Mutex<TpQwen3ForCausalLM>>.
- WorkerPool::generate_step fans GenerateStep out, runs the leader's
  rank-0 forward in spawn_blocking (the AllReduce CustomOps inside
  row-parallel layers block until every worker issues the matching
  collective), returns the leader's last-position logits Tensor.
- WorkerPool::clear_kv_cache + unload_model follow the same pattern.

NcclState refactor (tp/nccl_state.rs):
- comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers
  can share a clone with TpQwen3ForCausalLM::load.
- new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves.
- single allow(clippy::arc_with_non_send_sync) at the canonical
  construction site (Comm is !Send by type but the runtime invariant
  is enforced by SendComm + the pool's Mutex).

Harness side (candle.rs):
- LoadedHandle enum (Single | Tp) replaces the bare Arc<LoadedModel>
  in the harness's registry. list_models / unload_model /
  inference_endpoint walk the enum uniformly.
- TpLoadedModel holds the pool + leader_model + tokenizer + devices.
- load_model dispatches on `spec.tensor_parallel > 1` to a new
  cuda-gated load_tp path: resolve dense files via hf-hub, spawn the
  pool, init_nccl, load_dense_shard.
- chat_completion branches on the handle variant. The TP path mirrors
  run_inference: clear_kv_cache, prefill, sample, decode loop,
  detokenize. Acquires the pool Mutex for the whole request.
- Streaming through TP is deferred to Stage 7c (returns Other(err)).

Script (script/validate-neuron.sh):
- 4th positional arg `tp_size` (default 1). When >1, switches to the
  dense path (tp + GGUF is mutually exclusive — bails) and adds
  `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES
  env overrides the default 0..N-1 device list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 06:38:33 +03:00
parent 9b8bd146f6
commit d46d8d4f6c
6 changed files with 960 additions and 40 deletions

View File

@@ -9,14 +9,15 @@
# after pushing new neuron builds.
#
# Usage:
# script/validate-neuron.sh [host] [model_id] [quant]
# script/validate-neuron.sh [host] [model_id] [quant] [tp_size]
#
# Defaults:
# host = beast.hanzalova.internal
# model_id = unsloth/Qwen3-0.6B-GGUF (official Qwen3-*-GGUF repos
# ship Q8_0 only; unsloth's mirror ships the full Q-spectrum
# including Q4_K_M)
# quant = Q4_K_M
# quant = Q4_K_M (empty = dense safetensors path)
# tp_size = unset (= 1 = single-GPU; pass 2 to drive the TP path)
set -euo pipefail
@@ -25,6 +26,11 @@ MODEL_ID="${2:-unsloth/Qwen3-0.6B-GGUF}"
# `${3-Q4_K_M}` (no colon) only uses the default when the arg is
# UNSET — passing an explicit empty string drives the dense path.
QUANT="${3-Q4_K_M}"
# tp_size > 1 forces the dense path (TP requires safetensors) and adds
# `tensor_parallel: N` to the load payload. The harness picks device
# indices 0..N-1 by default; override by passing NEURON_DEVICES="0,1,..."
# in the environment.
TP_SIZE="${4-1}"
PORT="${NEURON_PORT:-13131}"
BASE="http://${HOST}:${PORT}"
@@ -69,21 +75,43 @@ is_loaded() {
}
trigger_load() {
say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, device=[0])"
# Build the per-rank CUDA device list as a JSON array. Either
# honour NEURON_DEVICES (`0,1,2`) verbatim or default to
# `[0, 1, ..., tp_size - 1]`.
local devices_json
if [[ -n "${NEURON_DEVICES:-}" ]]; then
devices_json=$(jq -n -c --arg s "${NEURON_DEVICES}" \
'$s | split(",") | map(tonumber)')
else
devices_json=$(jq -n -c --argjson n "${TP_SIZE}" '[range(0; $n)]')
fi
say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, tp=${TP_SIZE}, devices=${devices_json})"
say " (synchronous; may take a minute on first run while HF downloads)"
# Build the payload via jq so the optional `quant` field is
# omitted entirely when empty — that's the signal to the harness
# to take the dense safetensors load path rather than GGUF.
if (( TP_SIZE > 1 )) && [[ -n "${QUANT}" ]]; then
die "tp_size>1 requires dense safetensors — pass quant='' as the 3rd argument"
fi
# Build the payload via jq so the optional `quant` and
# `tensor_parallel` fields are omitted entirely when not in use —
# that's how the harness tells dense from quantized and single-GPU
# from TP.
local payload
if [[ -z "${QUANT}" ]]; then
if [[ -z "${QUANT}" ]] && (( TP_SIZE > 1 )); then
payload=$(jq -n -c \
--arg id "${MODEL_ID}" \
'{model_id: $id, harness: "candle", devices: [0]}')
--argjson tp "${TP_SIZE}" \
--argjson devices "${devices_json}" \
'{model_id: $id, harness: "candle", tensor_parallel: $tp, devices: $devices}')
elif [[ -z "${QUANT}" ]]; then
payload=$(jq -n -c \
--arg id "${MODEL_ID}" \
--argjson devices "${devices_json}" \
'{model_id: $id, harness: "candle", devices: $devices}')
else
payload=$(jq -n -c \
--arg id "${MODEL_ID}" \
--arg q "${QUANT}" \
'{model_id: $id, harness: "candle", quant: $q, devices: [0]}')
--argjson devices "${devices_json}" \
'{model_id: $id, harness: "candle", quant: $q, devices: $devices}')
fi
# --write-out captures the response code on a separate line so we
# can surface a real diagnostic instead of relying on --fail.