Expressive speech generation
Produces natural-sounding speech with expressive control, including emotion tags such as sad, whispered, and excited.
MAI-Voice-2 is Microsoft AI’s text-to-speech model for natural, expressive speech in assistants, support, narration, and accessibility. Available in Microsoft Foundry.
MAI-Voice-2 is Microsoft AI’s text-to-speech model for generating natural, expressive speech for products and services where voice quality affects the user experience. Microsoft positions it for assistants, customer support, audiobooks, accessibility experiences, and other long-form or brand-sensitive voice workflows.
The model is available in Microsoft Foundry and is also being integrated into VS Code and Dynamics 365 Contact Center. Microsoft says it supports 15 languages/locales, emotion control through tags, zero-shot voice prompting from short reference audio, and code-switching for select language pairs, while keeping speaker identity consistent across longer generations.
Produces natural-sounding speech with expressive control, including emotion tags such as sad, whispered, and excited.
Extends coverage from English-only to 15 languages/locales while aiming to keep the same naturalness and expressiveness.
Uses 5–60 seconds of reference audio to create a custom voice without retraining or fine-tuning.
Maintains speaker identity across long-form output such as audiobooks, podcasts, and lectures.
Supports code-switching for select language pairs such as Hindi-English and Spanish-English.
Includes consent guardrails so only authorized, licensed voices can be synthesized in production.
Use MAI-Voice-2 to give assistants or customer support products a branded, consistent voice that matches the experience users hear from your product.
Generate narration for long-form audio such as audiobooks, podcasts, and lectures, where stable speaker identity matters over extended output.
Create accessible voice interfaces for visually impaired users or people who rely on speech output as their primary way to interact with software.
Build character voices for games, AR/VR, or scripted media, with control over emotion and delivery style.
Use short reference audio to create a custom voice in Microsoft Foundry for product teams that want their own voice without training a separate model.
MAI-Voice-2 is available in Microsoft Foundry, and Microsoft says it is also being integrated into VS Code and Dynamics 365 Contact Center.
The page describes MAI-Voice-2 as a text-to-speech model with support for 15 languages/locales, emotion tags, zero-shot voice prompting from 5–60 seconds of reference audio, code-switching for select language pairs, and stable speaker identity across long-form output.
Microsoft says custom voices can be created in Microsoft Foundry with a short reference clip and without retraining or fine-tuning, but only authorized, licensed voices can be synthesized in production.
The launch page lists supported languages/locales, including English (US), English (Australia), Italian, French, German, Hindi, Spanish (Spain), Spanish (Mexico), Portuguese (Brazil), Portuguese (Portugal), Korean, Chinese (Simplified), Turkish, Russian, Thai, Dutch, Romanian, and Hungarian.
Wallie is an open-source AI streamer that watches your screen, hears chat, and delivers live commentary in a configurable persona. Runs locally with your own keys.
BeFreed is a personalized audio learning app that turns books and other knowledge sources into narrated listening experiences. It helps people learn on demand through interactive audio, voice selection, and built-in learning tools.
Gemini 3.1 Flash TTS is Google’s preview text-to-speech model for generating expressive AI speech with fine-grained control over style and delivery. It is available across the Gemini API, Google AI Studio, Vertex AI, and Google Vids.
蓝藻AI是一款在线AI配音与语音合成产品,可将文字转成语音,并支持自助声音克隆。页面信息显示它面向短视频、有声书等需要配音的内容场景。
Ondoku is a browser-based text-to-speech tool that turns text into downloadable .mp3 audio, with free and paid plans, multilingual reading, image reading, and commercial use options.
Typecast is an online AI voice generator that turns text into life-like speech with emotional delivery and a selection of hyper-realistic voices. It is a browser-based tool for creating spoken audio from written content.