Skip to content

ARM64 int8 quantized model produces hallucinations on short utterances #167

@eckmannmiles91

Description

@eckmannmiles91

Description

Moonshine Base with int8 quantization on ARM64 (Raspberry Pi 5) produces severe hallucinations on short voice commands. The same audio transcribes correctly on x86 CPU with the non-quantized model.

Environment

  • Model: Moonshine Base (quantized ONNX, from moonshine-voice pip package)
  • Hardware: Raspberry Pi 5 (BCM2712, 8GB RAM, aarch64)
  • Python: 3.13
  • onnxruntime: 1.24.2 (ARM64 wheel from PyPI)
  • moonshine-voice: 0.0.51

Examples of hallucinations

Actual speech Quantized ARM64 output Non-quantized x86 output
"Dim the lights" "Do you have a main bedroom" "Dim the lights"
"Where is Jennie" "It was home." "Where is Jenny?"
"Where is Kinzleigh" "Hinsulate at." "Where's Kinsley at?"
(silence/noise) "Thank you for watching" ""
(silence/noise) "Please subscribe" ""

The hallucinations are not random — they're consistent patterns that suggest the int8 quantization lost precision in the decoder's attention layers, causing it to generate plausible-sounding but completely wrong text.

Hallucination patterns we've cataloged

From ~1000 production voice interactions, these are the most common hallucination categories:

On silence/noise (model generates text from nothing)

  • "Thank you for watching/listening"
  • "Please subscribe"
  • "See you next time"
  • "You..." / "Bye..." (repeated filler)

On short utterances (model replaces the actual content)

  • Short commands (2-4 words) get replaced with unrelated longer phrases
  • Names are especially affected — "Kinzleigh" becomes completely unrecognizable

Garbage fragments from background audio

  • TV audio leaks through and produces sentence fragments
  • Music playing produces "I love the music" / "Nice music" / "Good music"

Our workaround

We maintain a multi-layer garbage filter in our voice pipeline:

  1. Regex patterns for known hallucination phrases (~10 patterns)
  2. Exact-match frozenset for short garbage tokens (~40 entries)
  3. Music-fragment regex for TV audio leak (~15 patterns)

This catches ~95% of hallucinations but doesn't fix the core issue — the int8 ARM64 model is producing wrong transcriptions for valid speech, not just hallucinating on silence.

Our current architecture (workaround)

We moved STT to an Intel Ultra 7 CPU running the non-quantized Moonshine Base model via a Wyoming protocol wrapper. This produces accurate transcriptions at ~88-250ms inference. The Pi 5 local Moonshine is kept as a fallback only.

Questions

  1. Is the int8 quantization tested on ARM64? The accuracy degradation seems much worse on ARM than x86.
  2. Are there plans for an ARM64-optimized quantization (e.g., int8 with calibration data, or dynamic quantization instead of static)?
  3. Would a quality regression test suite for different platforms help? We can contribute our test cases.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions