Reduce TP=2 Q6K cold-load time for Qwen3.6-27B (~5 min today) #1

Closed
opened 2026-05-27 10:31:42 +00:00 by grenade · 2 comments
Owner

Context

Cold-loading Qwen/Qwen3.6-27B with quant = "q6k" and tensor_parallel = 2 on beast currently takes ~5 minutes. The dominant phase is layer-by-layer Q6K in-situ quantization, single-threaded per rank: ~60–90 s per rank, run serially across 52 transformer layers plus the lm_head. Everything else is small change (subprocess spawn ~20 ms, NCCL init ~1 s, sanity check stubbed, mmap setup, disk read).

Cold-load latency matters because default_models in neuron.toml is what makes a host ready to serve after deploy.sh — and because a poisoned-thread restart pays the full cold-load cost again.

Findings from codebase analysis

  • candle_core::quantized::QTensor::quantize is not internally parallel. candle-core-0.10.2/src/quantized/k_quants.rs imports rayon but the Q6K from_float path (lines 1989–2059) is purely sequential per-block. Each 256-element block is independent — embarrassingly parallel — but the implementation processes them serially.
  • Layer loop is serial. tp_qwen3_5.rs:875–892 iterates 52 layers one at a time; no rayon::par_iter or tokio::spawn over the per-layer quantization step (MaybeQuantLinear::from_weight at tp_linear.rs:56–68).
  • lm_head is materialized and quantized on every rank. Both TpQwen3_5ForCausalLM::load() (tp_qwen3_5.rs:1003–1017) and worker.rs:298–306 call build_lm_head(cfg, vb, &base, quant). Workers then compute logits through it during forward and discard the result (worker.rs:383–386 comment). The Q6K encoding of the ~152K × 12288 ≈ 120M-weight lm_head on every worker is pure waste.
  • No per-phase wall-clock timing. log_construction_complete (tp_qwen3_5.rs:1154) brackets the whole load, but there's no per-phase elapsed_ms for NCCL init, mmap setup, each layer load, or lm_head. We're optimizing partly by inference.

Opportunities (ordered by impact-to-effort ratio)

1. Parallelize layer quantization within a rank via rayon::par_iter

Each layer's weight tensors are independent once sliced out of the mmap. Beast has many cores; on a multi-core host this should give 5–10× speedup over the 52 layers. Two implementation shapes:

  • Bigger payoff: pre-stage all per-layer byte slices off the mmap onto host buffers, then par_iter the quantization, then assemble the model on the worker thread.
  • Smaller payoff: build the model layer-by-layer but par_iter the quantization step within each layer's QKV/MLP set.

Likely the single biggest win.

2. Skip lm_head quantization on worker ranks

Workers don't need a quantized lm_head — they discard its output. Two graded approaches:

  • Cheap (~1-line edit): pass quant = None to build_lm_head on rank > 0 in tp_qwen3_5.rs:1052–1076. Keeps the matmul cost but eliminates the Q6K encoding on every worker. Saves ~2–3 s per worker.
  • Bigger: skip build_lm_head and the hidden.apply(&self.lm_head) entirely on rank > 0, returning a sentinel. Saves the matmul cost too. Requires touching the forward path.

3. Disk-cache post-ISQ weights

First cold load runs ISQ; subsequent loads memcpy a pre-quantized blob into VRAM. GGUF is exactly this, but the qwen3_5 (Qwen3-Next) arch has no upstream GGUF — hence the runtime ISQ. We could write our own per-rank .helexa-q6k cache keyed on (model_id, quant, rank, world_size, tp_topology). Significantly more code, but warm loads would drop to 30–60 s (just disk → VRAM).

4. Add per-phase timing instrumentation before optimizing more aggressively

Spend the half-hour to add tracing::info!(elapsed_ms = …) around: NCCL init, mmap setup, each layer load+quant, lm_head, post-construct. Without this we can't validate that #1 above captures ~80% vs ~30% of the total. Low-effort, high-information.

5. Upstream candle: rayon-ify Q6K from_float

par_chunks_mut(QK_K) over the block loop would give linear scaling. Benefits every model that uses Q6K ISQ. Local workaround: vendor candle and patch in place.

6. Skip log_vram per layer

tp_qwen3_5.rs:874, 892 queries the driver every layer for diagnostic VRAM. Gate behind cfg!(debug_assertions) or a per-load flag. Marginal.

Not worth pursuing

  • Shared mmap across ranks — OS page cache already absorbs the duplication; fd count is 2, not 200.
  • Subprocess spawn overhead — ~20 ms vs. minutes.
  • Parallel disk read across ranks — already happens.

Suggested sequence

  1. #4 first (instrument) so subsequent measurements are real.
  2. #2 cheap variant as the first one-line win.
  3. #1 (rayon over per-layer quantization) — the main lever.
  4. #3 (disk cache) for the long tail once everything else is exhausted.
  • crates/neuron/src/harness/tp/tp_qwen3_5.rs:875–892 — per-layer load loop
  • crates/neuron/src/harness/tp/tp_qwen3_5.rs:1052–1076build_lm_head
  • crates/neuron/src/harness/tp/tp_linear.rs:56–68MaybeQuantLinear::from_weight
  • crates/neuron/src/harness/tp/worker.rs:298–306, 383–386 — worker load + forward (logits discarded)
  • crates/neuron/src/harness/tp/mod.rs:236–296, 316–362, 487–556 — WorkerPool spawn, NCCL init, load_dense_shard
  • crates/neuron/src/harness/device_worker/dispatch.rs:594–673tp_load_shard_inner
## Context Cold-loading `Qwen/Qwen3.6-27B` with `quant = "q6k"` and `tensor_parallel = 2` on beast currently takes ~5 minutes. The dominant phase is **layer-by-layer Q6K in-situ quantization, single-threaded per rank**: ~60–90 s per rank, run serially across 52 transformer layers plus the lm_head. Everything else is small change (subprocess spawn ~20 ms, NCCL init ~1 s, sanity check stubbed, mmap setup, disk read). Cold-load latency matters because `default_models` in `neuron.toml` is what makes a host ready to serve after deploy.sh — and because a poisoned-thread restart pays the full cold-load cost again. ## Findings from codebase analysis - **`candle_core::quantized::QTensor::quantize` is not internally parallel.** `candle-core-0.10.2/src/quantized/k_quants.rs` imports `rayon` but the Q6K `from_float` path (lines 1989–2059) is purely sequential per-block. Each 256-element block is independent — embarrassingly parallel — but the implementation processes them serially. - **Layer loop is serial.** `tp_qwen3_5.rs:875–892` iterates 52 layers one at a time; no `rayon::par_iter` or `tokio::spawn` over the per-layer quantization step (`MaybeQuantLinear::from_weight` at `tp_linear.rs:56–68`). - **lm_head is materialized and quantized on every rank.** Both `TpQwen3_5ForCausalLM::load()` (tp_qwen3_5.rs:1003–1017) and `worker.rs:298–306` call `build_lm_head(cfg, vb, &base, quant)`. Workers then compute logits through it during forward and **discard the result** (`worker.rs:383–386` comment). The Q6K encoding of the ~152K × 12288 ≈ 120M-weight lm_head on every worker is pure waste. - **No per-phase wall-clock timing.** `log_construction_complete` (`tp_qwen3_5.rs:1154`) brackets the whole load, but there's no per-phase `elapsed_ms` for NCCL init, mmap setup, each layer load, or lm_head. We're optimizing partly by inference. ## Opportunities (ordered by impact-to-effort ratio) ### 1. Parallelize layer quantization within a rank via `rayon::par_iter` Each layer's weight tensors are independent once sliced out of the mmap. Beast has many cores; on a multi-core host this should give 5–10× speedup over the 52 layers. Two implementation shapes: - **Bigger payoff**: pre-stage all per-layer byte slices off the mmap onto host buffers, then `par_iter` the quantization, then assemble the model on the worker thread. - **Smaller payoff**: build the model layer-by-layer but `par_iter` the quantization step within each layer's QKV/MLP set. Likely the single biggest win. ### 2. Skip lm_head quantization on worker ranks Workers don't need a quantized lm_head — they discard its output. Two graded approaches: - **Cheap (~1-line edit)**: pass `quant = None` to `build_lm_head` on rank > 0 in `tp_qwen3_5.rs:1052–1076`. Keeps the matmul cost but eliminates the Q6K encoding on every worker. Saves ~2–3 s per worker. - **Bigger**: skip `build_lm_head` and the `hidden.apply(&self.lm_head)` entirely on rank > 0, returning a sentinel. Saves the matmul cost too. Requires touching the forward path. ### 3. Disk-cache post-ISQ weights First cold load runs ISQ; subsequent loads memcpy a pre-quantized blob into VRAM. GGUF is exactly this, but the qwen3_5 (Qwen3-Next) arch has no upstream GGUF — hence the runtime ISQ. We could write our own per-rank `.helexa-q6k` cache keyed on (model_id, quant, rank, world_size, tp_topology). Significantly more code, but warm loads would drop to 30–60 s (just disk → VRAM). ### 4. Add per-phase timing instrumentation **before** optimizing more aggressively Spend the half-hour to add `tracing::info!(elapsed_ms = …)` around: NCCL init, mmap setup, each layer load+quant, lm_head, post-construct. Without this we can't validate that #1 above captures ~80% vs ~30% of the total. Low-effort, high-information. ### 5. Upstream candle: rayon-ify Q6K `from_float` `par_chunks_mut(QK_K)` over the block loop would give linear scaling. Benefits every model that uses Q6K ISQ. Local workaround: vendor candle and patch in place. ### 6. Skip `log_vram` per layer `tp_qwen3_5.rs:874, 892` queries the driver every layer for diagnostic VRAM. Gate behind `cfg!(debug_assertions)` or a per-load flag. Marginal. ## Not worth pursuing - Shared mmap across ranks — OS page cache already absorbs the duplication; fd count is 2, not 200. - Subprocess spawn overhead — ~20 ms vs. minutes. - Parallel disk read across ranks — already happens. ## Suggested sequence 1. **#4 first** (instrument) so subsequent measurements are real. 2. **#2 cheap variant** as the first one-line win. 3. **#1** (rayon over per-layer quantization) — the main lever. 4. **#3** (disk cache) for the long tail once everything else is exhausted. ## Related code - `crates/neuron/src/harness/tp/tp_qwen3_5.rs:875–892` — per-layer load loop - `crates/neuron/src/harness/tp/tp_qwen3_5.rs:1052–1076` — `build_lm_head` - `crates/neuron/src/harness/tp/tp_linear.rs:56–68` — `MaybeQuantLinear::from_weight` - `crates/neuron/src/harness/tp/worker.rs:298–306, 383–386` — worker load + forward (logits discarded) - `crates/neuron/src/harness/tp/mod.rs:236–296, 316–362, 487–556` — WorkerPool spawn, NCCL init, load_dense_shard - `crates/neuron/src/harness/device_worker/dispatch.rs:594–673` — `tp_load_shard_inner`
Author
Owner

Still in scope, and elevated by the 2026-06-12 reframing (README positioning + the 7→8 milestone): cold-load latency is one of the operator-felt metrics the benchmark harness (#22) will measure and publish, and it is the recurring cost of #17-style auto-recovery (every poison→reload pays this in full). The parallelization findings in the body (per-block Q6K quantization is embarrassingly parallel; the 52-layer loop is serial) remain the right attack — CPU-side rayon work or a small candle patch, per the closure rationale on #2.

Still in scope, and elevated by the 2026-06-12 reframing (README positioning + the 7→8 milestone): cold-load latency is one of the operator-felt metrics the benchmark harness (#22) will measure and publish, and it is the recurring cost of #17-style auto-recovery (every poison→reload pays this in full). The parallelization findings in the body (per-block Q6K quantization is embarrassingly parallel; the 52-layer loop is serial) remain the right attack — CPU-side rayon work or a small candle patch, per the closure rationale on #2.
grenade added the p2-next label 2026-06-12 09:01:45 +00:00
Author
Owner

Closing numbers. Landed via #40 (parallel in-situ quantization + the per-phase timing this issue asked for first).

Before/after — Qwen3.6-27B Q6K TP=2 cold load on beast (service start → TP model loaded, journal-verified)

before (serial ISQ) after (#40)
cold load 221 s (23:07:11 → 23:10:52) 86 s (00:41:11 → 00:42:37) 2.6×

Phase breakdown after (from the new instrumentation): layer loop 79.0 s (uniform ~1.2 s/layer/rank × 64 layers, ranks in parallel), lm_head 5.2 s, everything else ~2 s. Every #17 auto-recovery now pays 86 s instead of 221 s.

What shipped

  • harness/tp/isq.rs: quantize_parallel — candle's per-block k-quant math (k_quants::GgmlType::from_float, public API, no fork) fanned across the rayon pool, byte-identical to QTensor::quantize (parity-tested for Q6K/Q5K/Q4K/Q8_0). Device↔host transfers stay on the context-owning thread per the worker discipline.
  • Per-layer debug + layer-loop/lm_head info timing on all ranks, so the next person optimizing this sees the phase split instead of inferring it.

Remaining lever (issue body item 3)

With ISQ compute now spread over 32 cores (~2–3 s total), the residual 79 s is dominated by reading the 54 GB bf16 safetensors (~700 MB/s effective with both ranks reading) plus slicing/upload. The post-ISQ disk cache idea would cut that to ~27 GB of pre-quantized blob per fleet-load and skip ISQ entirely (~30–40 s estimate) — worth a fresh issue only if 86 s still hurts in practice. Items 2 (lm_head on workers — now only ~5 s, and the quant=None variant costs ~1.5 GB VRAM per worker) and 6 (per-layer log_vram) are no longer worth their complexity at the new baseline.

Closing numbers. Landed via #40 (parallel in-situ quantization + the per-phase timing this issue asked for first). ## Before/after — Qwen3.6-27B Q6K TP=2 cold load on beast (service start → `TP model loaded`, journal-verified) | | before (serial ISQ) | after (#40) | | |---|---:|---:|---| | cold load | **221 s** (23:07:11 → 23:10:52) | **86 s** (00:41:11 → 00:42:37) | **2.6×** | Phase breakdown after (from the new instrumentation): layer loop **79.0 s** (uniform ~1.2 s/layer/rank × 64 layers, ranks in parallel), lm_head **5.2 s**, everything else ~2 s. Every #17 auto-recovery now pays 86 s instead of 221 s. ## What shipped - `harness/tp/isq.rs`: `quantize_parallel` — candle's per-block k-quant math (`k_quants::GgmlType::from_float`, public API, no fork) fanned across the rayon pool, **byte-identical** to `QTensor::quantize` (parity-tested for Q6K/Q5K/Q4K/Q8_0). Device↔host transfers stay on the context-owning thread per the worker discipline. - Per-layer debug + layer-loop/lm_head info timing on all ranks, so the next person optimizing this sees the phase split instead of inferring it. ## Remaining lever (issue body item 3) With ISQ compute now spread over 32 cores (~2–3 s total), the residual 79 s is dominated by reading the 54 GB bf16 safetensors (~700 MB/s effective with both ranks reading) plus slicing/upload. The post-ISQ disk cache idea would cut that to ~27 GB of pre-quantized blob per fleet-load and skip ISQ entirely (~30–40 s estimate) — worth a fresh issue only if 86 s still hurts in practice. Items 2 (lm_head on workers — now only ~5 s, and the `quant=None` variant costs ~1.5 GB VRAM per worker) and 6 (per-layer `log_vram`) are no longer worth their complexity at the new baseline.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/helexa#1