Description
Moonshine Base with int8 quantization on ARM64 (Raspberry Pi 5) produces severe hallucinations on short voice commands. The same audio transcribes correctly on x86 CPU with the non-quantized model.
Environment
- Model: Moonshine Base (quantized ONNX, from
moonshine-voice pip package)
- Hardware: Raspberry Pi 5 (BCM2712, 8GB RAM, aarch64)
- Python: 3.13
- onnxruntime: 1.24.2 (ARM64 wheel from PyPI)
- moonshine-voice: 0.0.51
Examples of hallucinations
| Actual speech |
Quantized ARM64 output |
Non-quantized x86 output |
| "Dim the lights" |
"Do you have a main bedroom" |
"Dim the lights" |
| "Where is Jennie" |
"It was home." |
"Where is Jenny?" |
| "Where is Kinzleigh" |
"Hinsulate at." |
"Where's Kinsley at?" |
| (silence/noise) |
"Thank you for watching" |
"" |
| (silence/noise) |
"Please subscribe" |
"" |
The hallucinations are not random — they're consistent patterns that suggest the int8 quantization lost precision in the decoder's attention layers, causing it to generate plausible-sounding but completely wrong text.
Hallucination patterns we've cataloged
From ~1000 production voice interactions, these are the most common hallucination categories:
On silence/noise (model generates text from nothing)
- "Thank you for watching/listening"
- "Please subscribe"
- "See you next time"
- "You..." / "Bye..." (repeated filler)
On short utterances (model replaces the actual content)
- Short commands (2-4 words) get replaced with unrelated longer phrases
- Names are especially affected — "Kinzleigh" becomes completely unrecognizable
Garbage fragments from background audio
- TV audio leaks through and produces sentence fragments
- Music playing produces "I love the music" / "Nice music" / "Good music"
Our workaround
We maintain a multi-layer garbage filter in our voice pipeline:
- Regex patterns for known hallucination phrases (~10 patterns)
- Exact-match frozenset for short garbage tokens (~40 entries)
- Music-fragment regex for TV audio leak (~15 patterns)
This catches ~95% of hallucinations but doesn't fix the core issue — the int8 ARM64 model is producing wrong transcriptions for valid speech, not just hallucinating on silence.
Our current architecture (workaround)
We moved STT to an Intel Ultra 7 CPU running the non-quantized Moonshine Base model via a Wyoming protocol wrapper. This produces accurate transcriptions at ~88-250ms inference. The Pi 5 local Moonshine is kept as a fallback only.
Questions
- Is the int8 quantization tested on ARM64? The accuracy degradation seems much worse on ARM than x86.
- Are there plans for an ARM64-optimized quantization (e.g., int8 with calibration data, or dynamic quantization instead of static)?
- Would a quality regression test suite for different platforms help? We can contribute our test cases.
Related
Description
Moonshine Base with int8 quantization on ARM64 (Raspberry Pi 5) produces severe hallucinations on short voice commands. The same audio transcribes correctly on x86 CPU with the non-quantized model.
Environment
moonshine-voicepip package)Examples of hallucinations
The hallucinations are not random — they're consistent patterns that suggest the int8 quantization lost precision in the decoder's attention layers, causing it to generate plausible-sounding but completely wrong text.
Hallucination patterns we've cataloged
From ~1000 production voice interactions, these are the most common hallucination categories:
On silence/noise (model generates text from nothing)
On short utterances (model replaces the actual content)
Garbage fragments from background audio
Our workaround
We maintain a multi-layer garbage filter in our voice pipeline:
This catches ~95% of hallucinations but doesn't fix the core issue — the int8 ARM64 model is producing wrong transcriptions for valid speech, not just hallucinating on silence.
Our current architecture (workaround)
We moved STT to an Intel Ultra 7 CPU running the non-quantized Moonshine Base model via a Wyoming protocol wrapper. This produces accurate transcriptions at ~88-250ms inference. The Pi 5 local Moonshine is kept as a fallback only.
Questions
Related