NEWFEATUREDr/LocalLLaMA· rssEN08:33 · 06·13
→Zyphra open-sources ZONOS2: an 8B-param, 900M-active real-time TTS with high-fidelity voice cloning
Zyphra released ZONOS2, an open-source real-time TTS model with 8B total params and only 900M active at inference. It uses a sparse MoE design to balance speed and expressiveness, with zero-shot voice cloning as the headline feature. The model reads raw UTF-8 bytes instead of using a phonemizer, which helps with Chinese, Korean, Japanese, and mid-sentence code-switching. Audio runs through a 44.1kHz Descript Audio Codec for studio-quality output, and training data scaled from 200K to over 6M hours. On the TTSDS prosody benchmark it scores 88.7, ahead of Qwen 3 TTS, Cartesia Sonic 3.5, and ElevenLabs V3. Weights are Apache 2.0, and inference is also available on Zyphra Cloud running AMD hardware.
#Zyphra#ZONOS2#Qwen 3 TTS
why featured
ZONOS2 combines real-time inference, high-fidelity cloning, and Apache 2.0 licensing in local TTS — the engineering choices are worth a look. Score stays at 74 because we only have the Reddit post; no independent benchmarks or side-by-side comparisons yet, so real-world clonin...
editor take
Zyphra open-sourced an 8B-param real-time TTS with 900M active params and zero-shot voice cloning under Apache 2.0.
sharp
The reason to click: Zyphra split the classic TTS quality-vs-speed tradeoff with a sparse MoE—8B total params, only 900M active at inference, real-time 44.1kHz output. Zero-shot voice cloning is the headline: feed it a reference clip and it mimics the speaker's timbre and style, no fine-tuning needed.
Text handling skips the phonemizer entirely. It reads raw UTF-8 bytes, which helps with Chinese, Korean, Japanese, and mid-sentence code-switching. Training data scaled from 200K to over 6M hours, with staged filtering to cut down hallucinations and repetitions.
TTSDS prosody score of 88.7 edges out Qwen 3 TTS and ElevenLabs V3, but don't lean on one benchmark—Zyphra also released ZTTS1-Eval, a new eval suite covering 17 languages with updated evaluation models. Weights are Apache 2.0, inference and eval code are public, and it runs on Zyphra Cloud's AMD hardware.
What's missing: real-world latency numbers from local setups and cloning consistency across accents. The Reddit thread doesn't have community benchmarks yet. I'd treat this as the strongest open-source TTS candidate for expressiveness right now, and wait for community latency reports before calling it production-ready.
HKR breakdown
hook ✓knowledge ✓resonance ✓