feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference

Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded Qwen3) end to end so a non-streaming chat completion can run across multiple GPUs via NCCL. RPC additions (tp/rpc.rs): - LoadDenseShard{model_id, config_json, safetensors_paths} - GenerateStep{model_id, tokens, offset} - ClearKvCache{model_id} - UnloadModel{model_id} - LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded Worker side (tp/worker.rs): - WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>` keyed by model_id. LoadDenseShard mmaps safetensors via ShardedVarBuilder (only this rank's slice materialises), builds the TP model with the rank's NCCL Comm cloned from NcclState. - GenerateStep runs the rank-local forward; the resulting logits are dropped (only the leader's are used for sampling). The forward's value here is the NCCL collectives inside the row-parallel layers letting the leader's rank-0 forward make progress. Pool side (tp/mod.rs): - WorkerPool::load_dense_shard fans LoadDenseShard out to every worker, builds rank 0's shard on the leader via spawn_blocking with a fresh SendComm wrapper at the move boundary (Comm is !Send at the type level), collects per-rank LoadDenseShardOk. Returns the leader's Arc<Mutex<TpQwen3ForCausalLM>>. - WorkerPool::generate_step fans GenerateStep out, runs the leader's rank-0 forward in spawn_blocking (the AllReduce CustomOps inside row-parallel layers block until every worker issues the matching collective), returns the leader's last-position logits Tensor. - WorkerPool::clear_kv_cache + unload_model follow the same pattern. NcclState refactor (tp/nccl_state.rs): - comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers can share a clone with TpQwen3ForCausalLM::load. - new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves. - single allow(clippy::arc_with_non_send_sync) at the canonical construction site (Comm is !Send by type but the runtime invariant is enforced by SendComm + the pool's Mutex). Harness side (candle.rs): - LoadedHandle enum (Single | Tp) replaces the bare Arc<LoadedModel> in the harness's registry. list_models / unload_model / inference_endpoint walk the enum uniformly. - TpLoadedModel holds the pool + leader_model + tokenizer + devices. - load_model dispatches on `spec.tensor_parallel > 1` to a new cuda-gated load_tp path: resolve dense files via hf-hub, spawn the pool, init_nccl, load_dense_shard. - chat_completion branches on the handle variant. The TP path mirrors run_inference: clear_kv_cache, prefill, sample, decode loop, detokenize. Acquires the pool Mutex for the whole request. - Streaming through TP is deferred to Stage 7c (returns Other(err)). Script (script/validate-neuron.sh): - 4th positional arg `tp_size` (default 1). When >1, switches to the dense path (tp + GGUF is mutually exclusive — bails) and adds `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES env overrides the default 0..N-1 device list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:38:33 +03:00
parent 9b8bd146f6
commit d46d8d4f6c
6 changed files with 960 additions and 40 deletions
--- a/script/validate-neuron.sh
+++ b/script/validate-neuron.sh
@@ -9,14 +9,15 @@
 # after pushing new neuron builds.
 #
 # Usage:
-#   script/validate-neuron.sh [host] [model_id] [quant]
+#   script/validate-neuron.sh [host] [model_id] [quant] [tp_size]
 #
 # Defaults:
 #   host     = beast.hanzalova.internal
 #   model_id = unsloth/Qwen3-0.6B-GGUF  (official Qwen3-*-GGUF repos
 #              ship Q8_0 only; unsloth's mirror ships the full Q-spectrum
 #              including Q4_K_M)
-#   quant    = Q4_K_M
+#   quant    = Q4_K_M  (empty = dense safetensors path)
+#   tp_size  = unset   (= 1 = single-GPU; pass 2 to drive the TP path)

 set -euo pipefail

@@ -25,6 +26,11 @@ MODEL_ID="${2:-unsloth/Qwen3-0.6B-GGUF}"
 # `${3-Q4_K_M}` (no colon) only uses the default when the arg is
 # UNSET — passing an explicit empty string drives the dense path.
 QUANT="${3-Q4_K_M}"
+# tp_size > 1 forces the dense path (TP requires safetensors) and adds
+# `tensor_parallel: N` to the load payload. The harness picks device
+# indices 0..N-1 by default; override by passing NEURON_DEVICES="0,1,..."
+# in the environment.
+TP_SIZE="${4-1}"
 PORT="${NEURON_PORT:-13131}"
 BASE="http://${HOST}:${PORT}"

@@ -69,21 +75,43 @@ is_loaded() {
 }

 trigger_load() {
-    say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, device=[0])"
+    # Build the per-rank CUDA device list as a JSON array. Either
+    # honour NEURON_DEVICES (`0,1,2`) verbatim or default to
+    # `[0, 1, ..., tp_size - 1]`.
+    local devices_json
+    if [[ -n "${NEURON_DEVICES:-}" ]]; then
+        devices_json=$(jq -n -c --arg s "${NEURON_DEVICES}" \
+            '$s | split(",") | map(tonumber)')
+    else
+        devices_json=$(jq -n -c --argjson n "${TP_SIZE}" '[range(0; $n)]')
+    fi
+    say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, tp=${TP_SIZE}, devices=${devices_json})"
    say "  (synchronous; may take a minute on first run while HF downloads)"
-    # Build the payload via jq so the optional `quant` field is
-    # omitted entirely when empty — that's the signal to the harness
-    # to take the dense safetensors load path rather than GGUF.
+    if (( TP_SIZE > 1 )) && [[ -n "${QUANT}" ]]; then
+        die "tp_size>1 requires dense safetensors — pass quant='' as the 3rd argument"
+    fi
+    # Build the payload via jq so the optional `quant` and
+    # `tensor_parallel` fields are omitted entirely when not in use —
+    # that's how the harness tells dense from quantized and single-GPU
+    # from TP.
    local payload
-    if [[ -z "${QUANT}" ]]; then
+    if [[ -z "${QUANT}" ]] && (( TP_SIZE > 1 )); then
        payload=$(jq -n -c \
            --arg id "${MODEL_ID}" \
-            '{model_id: $id, harness: "candle", devices: [0]}')
+            --argjson tp "${TP_SIZE}" \
+            --argjson devices "${devices_json}" \
+            '{model_id: $id, harness: "candle", tensor_parallel: $tp, devices: $devices}')
+    elif [[ -z "${QUANT}" ]]; then
+        payload=$(jq -n -c \
+            --arg id "${MODEL_ID}" \
+            --argjson devices "${devices_json}" \
+            '{model_id: $id, harness: "candle", devices: $devices}')
    else
        payload=$(jq -n -c \
            --arg id "${MODEL_ID}" \
            --arg q "${QUANT}" \
-            '{model_id: $id, harness: "candle", quant: $q, devices: [0]}')
+            --argjson devices "${devices_json}" \
+            '{model_id: $id, harness: "candle", quant: $q, devices: $devices}')
    fi
    # --write-out captures the response code on a separate line so we
    # can surface a real diagnostic instead of relying on --fail.