MAI-Voice-2

MAI-Voice-2 is Microsoft AI’s text-to-speech model for natural, expressive speech in assistants, support, narration, and accessibility. Available in Microsoft Foundry.

Text to Speech

Visit Website

Overview

MAI-Voice-2 is Microsoft AI’s text-to-speech model for generating natural, expressive speech for products and services where voice quality affects the user experience. Microsoft positions it for assistants, customer support, audiobooks, accessibility experiences, and other long-form or brand-sensitive voice workflows.

The model is available in Microsoft Foundry and is also being integrated into VS Code and Dynamics 365 Contact Center. Microsoft says it supports 15 languages/locales, emotion control through tags, zero-shot voice prompting from short reference audio, and code-switching for select language pairs, while keeping speaker identity consistent across longer generations.

Features and capabilities

Expressive speech generation

Produces natural-sounding speech with expressive control, including emotion tags such as sad, whispered, and excited.

Multilingual support

Extends coverage from English-only to 15 languages/locales while aiming to keep the same naturalness and expressiveness.

Zero-shot voice prompting

Uses 5–60 seconds of reference audio to create a custom voice without retraining or fine-tuning.

Stable speaker consistency

Maintains speaker identity across long-form output such as audiobooks, podcasts, and lectures.

Mixed-language speech

Supports code-switching for select language pairs such as Hindi-English and Spanish-English.

Consent controls

Includes consent guardrails so only authorized, licensed voices can be synthesized in production.

Use cases

Branded assistants and support
Use MAI-Voice-2 to give assistants or customer support products a branded, consistent voice that matches the experience users hear from your product.
Long-form narration
Generate narration for long-form audio such as audiobooks, podcasts, and lectures, where stable speaker identity matters over extended output.
Accessibility experiences
Create accessible voice interfaces for visually impaired users or people who rely on speech output as their primary way to interact with software.
Entertainment and character audio
Build character voices for games, AR/VR, or scripted media, with control over emotion and delivery style.
Custom brand voice creation
Use short reference audio to create a custom voice in Microsoft Foundry for product teams that want their own voice without training a separate model.

Pros and Cons

Pros

Supports 15 languages/locales, not just English.
Offers emotion tags for finer speech direction.
Can create a custom voice from a short reference clip without retraining or fine-tuning.
Maintains speaker identity across long-form audio.
Available in Microsoft Foundry and being integrated into VS Code and Dynamics 365 Contact Center.

Cons

Pricing is not disclosed on the product page, and the linked pricing page does not provide MAI-Voice-2 pricing details.
Some capabilities are limited to select language pairs, such as Hindi-English and Spanish-English, rather than all supported languages.
Custom voice access is gated by an application flow for authorized, licensed voices.

FAQ

Where can I use MAI-Voice-2?

MAI-Voice-2 is available in Microsoft Foundry, and Microsoft says it is also being integrated into VS Code and Dynamics 365 Contact Center.

What does MAI-Voice-2 do?

The page describes MAI-Voice-2 as a text-to-speech model with support for 15 languages/locales, emotion tags, zero-shot voice prompting from 5–60 seconds of reference audio, code-switching for select language pairs, and stable speaker identity across long-form output.

Can I create a custom voice with MAI-Voice-2?

Microsoft says custom voices can be created in Microsoft Foundry with a short reference clip and without retraining or fine-tuning, but only authorized, licensed voices can be synthesized in production.

Which languages does MAI-Voice-2 support?

The launch page lists supported languages/locales, including English (US), English (Australia), Italian, French, German, Hindi, Spanish (Spain), Spanish (Mexico), Portuguese (Brazil), Portuguese (Portugal), Korean, Chinese (Simplified), Turkish, Russian, Thai, Dutch, Romanian, and Hungarian.

Quick Facts

Category: Text-to-speech
Product: MAI-Voice-2
Platform: Microsoft Foundry
Also integrated into: VS Code; Dynamics 365 Contact Center
Supported languages/locales: 15
Source domain: microsoft.ai

MAI-Voice-2 Alternatives

Wallie

Wallie is an open-source AI streamer that watches your screen, hears chat, and delivers live commentary in a configurable persona. Runs locally with your own keys.

BeFreed

BeFreed is a personalized audio learning app that turns books and other knowledge sources into narrated listening experiences. It helps people learn on demand through interactive audio, voice selection, and built-in learning tools.

Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS is Google’s preview text-to-speech model for generating expressive AI speech with fine-grained control over style and delivery. It is available across the Gemini API, Google AI Studio, Vertex AI, and Google Vids.

蓝藻AI

蓝藻AI是一款在线AI配音与语音合成产品，可将文字转成语音，并支持自助声音克隆。页面信息显示它面向短视频、有声书等需要配音的内容场景。

Ondoku

Ondoku is a browser-based text-to-speech tool that turns text into downloadable .mp3 audio, with free and paid plans, multilingual reading, image reading, and commercial use options.

Typecast

Typecast is an online AI voice generator that turns text into life-like speech with emotional delivery and a selection of hyper-realistic voices. It is a browser-based tool for creating spoken audio from written content.