ARM64 int8 quantized model produces hallucinations on short utterances

## Description

Moonshine Base with int8 quantization on ARM64 (Raspberry Pi 5) produces severe hallucinations on short voice commands. The same audio transcribes correctly on x86 CPU with the non-quantized model.

## Environment

- **Model:** Moonshine Base (quantized ONNX, from `moonshine-voice` pip package)
- **Hardware:** Raspberry Pi 5 (BCM2712, 8GB RAM, aarch64)
- **Python:** 3.13
- **onnxruntime:** 1.24.2 (ARM64 wheel from PyPI)
- **moonshine-voice:** 0.0.51

## Examples of hallucinations

| Actual speech | Quantized ARM64 output | Non-quantized x86 output |
|---|---|---|
| "Dim the lights" | "Do you have a main bedroom" | "Dim the lights" |
| "Where is Jennie" | "It was home." | "Where is Jenny?" |
| "Where is Kinzleigh" | "Hinsulate at." | "Where's Kinsley at?" |
| (silence/noise) | "Thank you for watching" | "" |
| (silence/noise) | "Please subscribe" | "" |

The hallucinations are not random — they're consistent patterns that suggest the int8 quantization lost precision in the decoder's attention layers, causing it to generate plausible-sounding but completely wrong text.

## Hallucination patterns we've cataloged

From ~1000 production voice interactions, these are the most common hallucination categories:

### On silence/noise (model generates text from nothing)
- "Thank you for watching/listening"
- "Please subscribe"
- "See you next time"
- "You..." / "Bye..." (repeated filler)

### On short utterances (model replaces the actual content)
- Short commands (2-4 words) get replaced with unrelated longer phrases
- Names are especially affected — "Kinzleigh" becomes completely unrecognizable

### Garbage fragments from background audio
- TV audio leaks through and produces sentence fragments
- Music playing produces "I love the music" / "Nice music" / "Good music"

## Our workaround

We maintain a multi-layer garbage filter in our voice pipeline:
1. Regex patterns for known hallucination phrases (~10 patterns)
2. Exact-match frozenset for short garbage tokens (~40 entries)
3. Music-fragment regex for TV audio leak (~15 patterns)

This catches ~95% of hallucinations but doesn't fix the core issue — the int8 ARM64 model is producing wrong transcriptions for valid speech, not just hallucinating on silence.

## Our current architecture (workaround)

We moved STT to an Intel Ultra 7 CPU running the non-quantized Moonshine Base model via a Wyoming protocol wrapper. This produces accurate transcriptions at ~88-250ms inference. The Pi 5 local Moonshine is kept as a fallback only.

## Questions

1. Is the int8 quantization tested on ARM64? The accuracy degradation seems much worse on ARM than x86.
2. Are there plans for an ARM64-optimized quantization (e.g., int8 with calibration data, or dynamic quantization instead of static)?
3. Would a quality regression test suite for different platforms help? We can contribute our test cases.

## Related

- #135 — Context biasing for custom vocabulary (would help with the name recognition issue)
- #129 — Unexpectedly slow on Intel CPUs (we tested both Intel and ARM)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64 int8 quantized model produces hallucinations on short utterances #167

Description

Environment

Examples of hallucinations

Hallucination patterns we've cataloged

On silence/noise (model generates text from nothing)

On short utterances (model replaces the actual content)

Garbage fragments from background audio

Our workaround

Our current architecture (workaround)

Questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Actual speech	Quantized ARM64 output	Non-quantized x86 output
"Dim the lights"	"Do you have a main bedroom"	"Dim the lights"
"Where is Jennie"	"It was home."	"Where is Jenny?"
"Where is Kinzleigh"	"Hinsulate at."	"Where's Kinsley at?"
(silence/noise)	"Thank you for watching"	""
(silence/noise)	"Please subscribe"	""

ARM64 int8 quantized model produces hallucinations on short utterances #167

Description

Description

Environment

Examples of hallucinations

Hallucination patterns we've cataloged

On silence/noise (model generates text from nothing)

On short utterances (model replaces the actual content)

Garbage fragments from background audio

Our workaround

Our current architecture (workaround)

Questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions