fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul
All checks were successful
CI / Format (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m26s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 4m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
All checks were successful
CI / Format (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m26s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 4m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly GGUF dequantize accumulates in f32. The model dtype flowing into MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with "unexpected dtype, expected: F32, got: BF16". Wrap the Quant arm to cast the activation to f32 before the matmul and cast the result back to the input dtype. The cast is a single launch on the activation tensor (small relative to weight traffic); it's the price of in-situ GGUF-style quantization, and what mistralrs does inside its own Linear wrapper. The Plain arm is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -75,7 +75,26 @@ impl Module for MaybeQuantLinear {
|
|||||||
fn forward(&self, x: &Tensor) -> candle_core::Result<Tensor> {
|
fn forward(&self, x: &Tensor) -> candle_core::Result<Tensor> {
|
||||||
match self {
|
match self {
|
||||||
Self::Plain(l) => l.forward(x),
|
Self::Plain(l) => l.forward(x),
|
||||||
Self::Quant(qm) => qm.forward(x),
|
Self::Quant(qm) => {
|
||||||
|
// candle's `QTensor::cuda_fwd` requires f32 inputs (the
|
||||||
|
// GGUF kernels dequantize on the fly and accumulate in
|
||||||
|
// f32). Our model dtype is bf16 (or f16) so we cast in
|
||||||
|
// and out at the matmul boundary. The cast itself is a
|
||||||
|
// single launch on the activation tensor — cheap vs the
|
||||||
|
// weight loads the matmul saves.
|
||||||
|
let in_dtype = x.dtype();
|
||||||
|
let x_f32 = if in_dtype == candle_core::DType::F32 {
|
||||||
|
x.clone()
|
||||||
|
} else {
|
||||||
|
x.to_dtype(candle_core::DType::F32)?
|
||||||
|
};
|
||||||
|
let y = qm.forward(&x_f32)?;
|
||||||
|
if y.dtype() == in_dtype {
|
||||||
|
Ok(y)
|
||||||
|
} else {
|
||||||
|
y.to_dtype(in_dtype)
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user