MAI-Voice-2 icon

MAI-Voice-2

MAI-Voice-2 is Microsoft AI’s text-to-speech model for natural, expressive speech in assistants, support experiences, long-form narration, and accessibility use cases. It is available in Microsoft Foundry and supports 15 languages/locales, emotion control, and short-reference custom voice creation.

MAI-Voice-2

Overview

MAI-Voice-2 is Microsoft AI’s text-to-speech model for generating natural, expressive speech for products and services where voice quality affects the user experience. Microsoft positions it for assistants, customer support, audiobooks, accessibility experiences, and other long-form or brand-sensitive voice workflows.

The model is available in Microsoft Foundry and is also being integrated into VS Code and Dynamics 365 Contact Center. Microsoft says it supports 15 languages/locales, emotion control through tags, zero-shot voice prompting from short reference audio, and code-switching for select language pairs, while keeping speaker identity consistent across longer generations.

Features and capabilities

Expressive speech generation

Produces natural-sounding speech with expressive control, including emotion tags such as sad, whispered, and excited.

Multilingual support

Extends coverage from English-only to 15 languages/locales while aiming to keep the same naturalness and expressiveness.

Zero-shot voice prompting

Uses 5–60 seconds of reference audio to create a custom voice without retraining or fine-tuning.

Stable speaker consistency

Maintains speaker identity across long-form output such as audiobooks, podcasts, and lectures.

Mixed-language speech

Supports code-switching for select language pairs such as Hindi-English and Spanish-English.

Consent controls

Includes consent guardrails so only authorized, licensed voices can be synthesized in production.

Use cases

  • Branded assistants and support

    Use MAI-Voice-2 to give assistants or customer support products a branded, consistent voice that matches the experience users hear from your product.

  • Long-form narration

    Generate narration for long-form audio such as audiobooks, podcasts, and lectures, where stable speaker identity matters over extended output.

  • Accessibility experiences

    Create accessible voice interfaces for visually impaired users or people who rely on speech output as their primary way to interact with software.

  • Entertainment and character audio

    Build character voices for games, AR/VR, or scripted media, with control over emotion and delivery style.

  • Custom brand voice creation

    Use short reference audio to create a custom voice in Microsoft Foundry for product teams that want their own voice without training a separate model.

Pros and Cons

Pros

  • Supports 15 languages/locales, not just English.
  • Offers emotion tags for finer speech direction.
  • Can create a custom voice from a short reference clip without retraining or fine-tuning.
  • Maintains speaker identity across long-form audio.
  • Available in Microsoft Foundry and being integrated into VS Code and Dynamics 365 Contact Center.

Cons

  • Pricing is not disclosed on the product page, and the linked pricing page does not provide MAI-Voice-2 pricing details.
  • Some capabilities are limited to select language pairs, such as Hindi-English and Spanish-English, rather than all supported languages.
  • Custom voice access is gated by an application flow for authorized, licensed voices.

FAQ

Where can I use MAI-Voice-2?

MAI-Voice-2 is available in Microsoft Foundry, and Microsoft says it is also being integrated into VS Code and Dynamics 365 Contact Center.

What does MAI-Voice-2 do?

The page describes MAI-Voice-2 as a text-to-speech model with support for 15 languages/locales, emotion tags, zero-shot voice prompting from 5–60 seconds of reference audio, code-switching for select language pairs, and stable speaker identity across long-form output.

Can I create a custom voice with MAI-Voice-2?

Microsoft says custom voices can be created in Microsoft Foundry with a short reference clip and without retraining or fine-tuning, but only authorized, licensed voices can be synthesized in production.

Which languages does MAI-Voice-2 support?

The launch page lists supported languages/locales, including English (US), English (Australia), Italian, French, German, Hindi, Spanish (Spain), Spanish (Mexico), Portuguese (Brazil), Portuguese (Portugal), Korean, Chinese (Simplified), Turkish, Russian, Thai, Dutch, Romanian, and Hungarian.

Quick Facts

Category
Text-to-speech
Product
MAI-Voice-2
Platform
Microsoft Foundry
Also integrated into
VS Code; Dynamics 365 Contact Center
Supported languages/locales
15
Source domain
microsoft.ai