Moonshine Voice is a free, open-source AI toolkit that supports Japanese and enables the development of real-time voice applications with higher accuracy than Whisper.



Moonshine Voice is an open source AI toolkit that allows you to create applications that handle voice in real time.

GitHub - moonshine-ai/moonshine: Fast and accurate automatic speech recognition (ASR) for edge devices
https://github.com/moonshine-ai/moonshine

Moonshine Voice runs entirely on your device, so it's fast and private—no accounts, credit cards, or API keys required.

Additionally, the framework and model are optimized for live streaming apps, doing a lot of processing while the user is talking and responding with low latency.

All models are trained from scratch based on original cutting-edge research, and are said to be more accurate than OpenAI's Whisper Large V3 speech recognition model.

Below are the benchmark results for live speech processing, sorted by lowest word error rate (WER): 'Moonshine Medium Streaming' outperforms 'Whisper Large V3', 'Moonshine Small Streaming' outperforms 'Whisper Small', and 'Moonshine Tiny Streaming' outperforms 'Whisper Tiny'.

Model name WER Number of parameters Processing speed (MacBook Pro) Processing speed (Linux x86) Processing speed (Raspberry Pi 5)
Moonshine Medium Streaming 6.65% 245 million 107ms 269ms 802ms
Whisper Large v3 7.44% 1.5 billion 11,286ms 16,919ms N/A
Moonshine Small Streaming 7.84% 123 million 73ms 165ms 527ms
Whisper Small 8.59% 244 million 1940ms 3,425ms 10,397ms
Moonshine Tiny Streaming 12.00% 34 million 34ms 69ms 237ms
Whisper Tiny 12.81% 39 million 277ms 1,141ms 5,863ms


Whisper was a major step forward in speech synthesis technology, and the largest model, Large V3, was available to companies other than major companies like Google and Apple, and was able to achieve high accuracy. For this reason, Moonshine is a big fan of 'faster-whisper' and other models. However, while building applications that require a live voice interface, he realized he needed features that Whisper did not have.

First, Whisper always operates within a 30-second input window. This isn't a problem when processing large amounts of audio. Simply finding chunks of audio about 30 seconds in advance and processing them sequentially is enough. However, with live audio interfaces, it's not possible to create large chunks of audio by looking at the input stream, and chunks themselves rarely last longer than 5-10 seconds. This requires unnecessary 'zero-filling' in the encoder and decoder, resulting in long wait times for results. Moonshine cites 'responsiveness' as its most important requirement, typically defined as latency of 200 milliseconds or less. It detracts from the user experience even on platforms with ample computing power, and becomes unusable on constrained devices.

The second point is that Whisper does not cache anything. A requirement of a voice interface is to display feedback while the user is speaking, which means repeatedly calling the speech-to-text model while speaking. However, Whisper starts from scratch every time, even if the input is fairly constant, so redundant processing occurs even for voices that have been processed before. This again introduces unnecessary latency and degrades the user experience.

The third issue is that Whisper doesn't support many languages. While Whisper can process and translate many languages with a single model, only 33 of the 82 languages achieved a WER of 20% or less. Furthermore, when run on a constrained device, the WER for only five languages falls below 20%. A version available via a cloud API appears to offer better accuracy, but it's not available as an open model.

Additionally, while the Whisper ecosystem itself is thriving, the researchers point out that differences in interfaces, functionality, and levels of optimization across edge platforms make it unnecessarily difficult to build applications that need to run on a variety of devices.

For this reason, Moonshine set out to create its own family of models that better meet the needs of live voice interfaces.

The library runs on Python, iOS, Android, macOS, Linux, Windows, Raspberry Pi, IoT devices, and wearables, making cross-platform integration easy.

GitHub - moonshine-ai/moonshine: Fast and accurate automatic speech recognition (ASR) for edge devices
https://github.com/moonshine-ai/moonshine?tab=readme-ov-file#quickstart

The high-level API can handle common tasks such as transcription, speaker identification, and command recognition, allowing even non-experts to build voice applications.

Supported languages include English, Spanish, Chinese (Mandarin), Japanese, Korean, Vietnamese, Ukrainian, and Arabic.

Future plans include reducing binary size for mobile deployment, implementing more languages, more streaming models, improved speaker identification, and lightweight domain customization.

in AI, Posted by logc_nt