Reduce TP=2 Q6K cold-load time for Qwen3.6-27B (~5 min today) #1
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Cold-loading
Qwen/Qwen3.6-27Bwithquant = "q6k"andtensor_parallel = 2on beast currently takes ~5 minutes. The dominant phase is layer-by-layer Q6K in-situ quantization, single-threaded per rank: ~60–90 s per rank, run serially across 52 transformer layers plus the lm_head. Everything else is small change (subprocess spawn ~20 ms, NCCL init ~1 s, sanity check stubbed, mmap setup, disk read).Cold-load latency matters because
default_modelsinneuron.tomlis what makes a host ready to serve after deploy.sh — and because a poisoned-thread restart pays the full cold-load cost again.Findings from codebase analysis
candle_core::quantized::QTensor::quantizeis not internally parallel.candle-core-0.10.2/src/quantized/k_quants.rsimportsrayonbut the Q6Kfrom_floatpath (lines 1989–2059) is purely sequential per-block. Each 256-element block is independent — embarrassingly parallel — but the implementation processes them serially.tp_qwen3_5.rs:875–892iterates 52 layers one at a time; norayon::par_iterortokio::spawnover the per-layer quantization step (MaybeQuantLinear::from_weightattp_linear.rs:56–68).TpQwen3_5ForCausalLM::load()(tp_qwen3_5.rs:1003–1017) andworker.rs:298–306callbuild_lm_head(cfg, vb, &base, quant). Workers then compute logits through it during forward and discard the result (worker.rs:383–386comment). The Q6K encoding of the ~152K × 12288 ≈ 120M-weight lm_head on every worker is pure waste.log_construction_complete(tp_qwen3_5.rs:1154) brackets the whole load, but there's no per-phaseelapsed_msfor NCCL init, mmap setup, each layer load, or lm_head. We're optimizing partly by inference.Opportunities (ordered by impact-to-effort ratio)
1. Parallelize layer quantization within a rank via
rayon::par_iterEach layer's weight tensors are independent once sliced out of the mmap. Beast has many cores; on a multi-core host this should give 5–10× speedup over the 52 layers. Two implementation shapes:
par_iterthe quantization, then assemble the model on the worker thread.par_iterthe quantization step within each layer's QKV/MLP set.Likely the single biggest win.
2. Skip lm_head quantization on worker ranks
Workers don't need a quantized lm_head — they discard its output. Two graded approaches:
quant = Nonetobuild_lm_headon rank > 0 intp_qwen3_5.rs:1052–1076. Keeps the matmul cost but eliminates the Q6K encoding on every worker. Saves ~2–3 s per worker.build_lm_headand thehidden.apply(&self.lm_head)entirely on rank > 0, returning a sentinel. Saves the matmul cost too. Requires touching the forward path.3. Disk-cache post-ISQ weights
First cold load runs ISQ; subsequent loads memcpy a pre-quantized blob into VRAM. GGUF is exactly this, but the qwen3_5 (Qwen3-Next) arch has no upstream GGUF — hence the runtime ISQ. We could write our own per-rank
.helexa-q6kcache keyed on (model_id, quant, rank, world_size, tp_topology). Significantly more code, but warm loads would drop to 30–60 s (just disk → VRAM).4. Add per-phase timing instrumentation before optimizing more aggressively
Spend the half-hour to add
tracing::info!(elapsed_ms = …)around: NCCL init, mmap setup, each layer load+quant, lm_head, post-construct. Without this we can't validate that #1 above captures ~80% vs ~30% of the total. Low-effort, high-information.5. Upstream candle: rayon-ify Q6K
from_floatpar_chunks_mut(QK_K)over the block loop would give linear scaling. Benefits every model that uses Q6K ISQ. Local workaround: vendor candle and patch in place.6. Skip
log_vramper layertp_qwen3_5.rs:874, 892queries the driver every layer for diagnostic VRAM. Gate behindcfg!(debug_assertions)or a per-load flag. Marginal.Not worth pursuing
Suggested sequence
Related code
crates/neuron/src/harness/tp/tp_qwen3_5.rs:875–892— per-layer load loopcrates/neuron/src/harness/tp/tp_qwen3_5.rs:1052–1076—build_lm_headcrates/neuron/src/harness/tp/tp_linear.rs:56–68—MaybeQuantLinear::from_weightcrates/neuron/src/harness/tp/worker.rs:298–306, 383–386— worker load + forward (logits discarded)crates/neuron/src/harness/tp/mod.rs:236–296, 316–362, 487–556— WorkerPool spawn, NCCL init, load_dense_shardcrates/neuron/src/harness/device_worker/dispatch.rs:594–673—tp_load_shard_innerStill in scope, and elevated by the 2026-06-12 reframing (README positioning + the 7→8 milestone): cold-load latency is one of the operator-felt metrics the benchmark harness (#22) will measure and publish, and it is the recurring cost of #17-style auto-recovery (every poison→reload pays this in full). The parallelization findings in the body (per-block Q6K quantization is embarrassingly parallel; the 52-layer loop is serial) remain the right attack — CPU-side rayon work or a small candle patch, per the closure rationale on #2.
Closing numbers. Landed via #40 (parallel in-situ quantization + the per-phase timing this issue asked for first).
Before/after — Qwen3.6-27B Q6K TP=2 cold load on beast (service start →
TP model loaded, journal-verified)Phase breakdown after (from the new instrumentation): layer loop 79.0 s (uniform ~1.2 s/layer/rank × 64 layers, ranks in parallel), lm_head 5.2 s, everything else ~2 s. Every #17 auto-recovery now pays 86 s instead of 221 s.
What shipped
harness/tp/isq.rs:quantize_parallel— candle's per-block k-quant math (k_quants::GgmlType::from_float, public API, no fork) fanned across the rayon pool, byte-identical toQTensor::quantize(parity-tested for Q6K/Q5K/Q4K/Q8_0). Device↔host transfers stay on the context-owning thread per the worker discipline.Remaining lever (issue body item 3)
With ISQ compute now spread over 32 cores (~2–3 s total), the residual 79 s is dominated by reading the 54 GB bf16 safetensors (~700 MB/s effective with both ranks reading) plus slicing/upload. The post-ISQ disk cache idea would cut that to ~27 GB of pre-quantized blob per fleet-load and skip ISQ entirely (~30–40 s estimate) — worth a fresh issue only if 86 s still hurts in practice. Items 2 (lm_head on workers — now only ~5 s, and the
quant=Nonevariant costs ~1.5 GB VRAM per worker) and 6 (per-layerlog_vram) are no longer worth their complexity at the new baseline.