93421f48e2e21847143fa994b19fcf7211c3c1fb
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
93421f48e2
|
Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 31s
CI / Clippy (push) Failing after 49s
build-prerelease / Build neuron-blackwell (push) Failing after 3m29s
build-prerelease / Build cortex binary (push) Successful in 4m41s
CI / Test (push) Successful in 5m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Failing after 5m1s
build-prerelease / Build neuron-ada (push) Failing after 4m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Adds harness/tp/sharded_linear.rs with ShardedLinear — a Megatron-LM
style sharded wrapper over candle_nn::Linear. Two constructors:
- load_column: splits the output dimension. Each rank holds rows
[r*out/N .. (r+1)*out/N] of the weight, plus its slice of the bias.
Forward = local matmul; output is naturally sharded; downstream
consumer either accepts the shard (next layer is column-parallel)
or merges via all-gather later.
- load_row: splits the input dimension. Each rank holds cols
[r*in/N .. (r+1)*in/N] of the weight; bias lives only on rank 0
so the post-all_reduce sum carries it exactly once. Forward
produces a partial output that the caller reduces via NCCL.
Both constructors bail with a clear error when divisibility doesn't
hold — the precondition mistral.rs's first qwen3-next-tp commit
made explicit. The path included in the error is the VarBuilder
prefix, so the operator sees exactly which projection failed
("column-parallel 'model.layers.0.self_attn.q_proj': out_features=...").
5 unit tests on CPU verify the math against an unsharded reference:
- column shard produces the expected slice of the full matmul
- row partials sum to the unsharded result
- row bias appears only on rank 0
- divisibility violations bail (column + row)
forward_with_comm() is stubbed for row-parallel (CUDA-only) — wiring
the actual cudarc::nccl all_reduce against candle's Tensor lands in
7b-iii alongside the model assembly, where the model holds the Comm
in scope. ColumnParallel's forward_with_comm just delegates to the
local matmul (no collective needed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
05e15f3597
|
Stage 7b-i: dense safetensors Qwen3 load path
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 43s
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Clippy (push) Successful in 2m4s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
da068ded6d
|
Stage 7a-ii: real NCCL handshake behind the worker pool
Some checks failed
CI / Format (push) Failing after 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Failing after 3m33s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Failing after 4m19s
build-prerelease / Build neuron-ada (push) Failing after 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.
NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
opens a CudaContext on the configured device, calls Comm::from_rank
with the supplied (rank, world_size, id). NCCL blocks until every
rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
the Comm across spawn_blocking boundaries (NCCL is move-safe; only
concurrent op issuance is unsafe).
WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.
Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.
Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
"cuda_feature_not_enabled" on the local dev box (no CUDA), or
accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
the real init + sanity round-trip; gated on the cuda-integration
feature so default CI doesn't try to NCCL.
Verifiable on beast (2× RTX 5090):
cargo test -p neuron --features cuda-integration \
--test tp_worker_lifecycle_cuda
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
2a7ede0232
|
Stage 7a-i: TP worker lifecycle scaffolding
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m9s
build-prerelease / Build neuron-ada (push) Successful in 4m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
Leader → worker process plumbing for tensor parallelism. The neuron
binary picks up two modes: default (the existing daemon, axum + HTTP)
and `--worker` (a bare RPC loop driven over stdin/stdout). The leader
spawns one worker per non-zero NCCL rank via tokio::process::Command
on the same binary path (production: /proc/self/exe; tests:
env!("CARGO_BIN_EXE_neuron")) and talks to each over newline-
delimited JSON.
Protocol (harness/tp/rpc.rs) is serde-tagged from the start —
WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and
WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both
`#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c
is purely additive; unknown ops on the wire fail to parse (verified
in unit tests).
7a-i scope:
- WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as
subprocesses, captures stdin/stdout, kills on drop.
- ping_all() round-trips a Ping to every worker and validates the
returned rank.
- shutdown() sends Shutdown to each worker, awaits Bye, reaps.
- Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and
NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii
binary speaking the same wire is a drop-in replacement (the kind
field signals "real NCCL lands in the next commit").
- CandleHarness::load_model refuses tensor_parallel > 1 with a clear
message until 7b is in.
Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/
ping/shutdown for 2- and 3-worker pools, plus the
not_implemented_7a_i contract test for Init. Seven rpc serde unit
tests assert the wire shape (op tags, field names, unknown-op
rejection). All pass on the dev host; no CUDA required.
Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the
existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on
beast's 2×5090.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|