MiniMax Speech-02-HD: Zero-Shot, Multilingual, Cost-Effective TTS

What sets MiniMax apart:

→ Zero-shot speaker cloning using raw audio (no transcripts required)

→ Flow-VAE model: no spectrograms needed, enabling faster and more natural speech

→ Multilingual and cross-lingual synthesis (supports Thai, Vietnamese, Cantonese, etc.)

→ Ranked #1 on Artificial Arena’s public TTS leaderboard

→ Emotion control via LoRA + T2V (text-to-voice from plain textual descriptions)

→ 4× more cost-effective — ElevenLabs Multilingual v2 costs over 4× more than MiniMax Speech-02-HD

Check out the full report on GitHub:

263 views