Voxtral TTS

Voxtral TTS is Mistral’s text-to-speech model for lifelike multilingual speech, voice agents, and enterprise voice workflows with low latency.

AI Speech Synthesis

AI Voice Assistants

Text to Speech

Visit Website

Overview

Voxtral TTS is Mistral’s first text-to-speech model, announced as an open-weights system for multilingual voice generation. It is designed to turn text into lifelike speech for voice agents and other speech interfaces, with a focus on naturalness, low latency, and easy adaptation to new voices.

Mistral positions the model for enterprise voice workflows where both quality and speed matter. The announcement highlights support for nine languages, emotionally expressive speech, custom voice adaptation from short references, and access through Mistral Studio, Le Chat, the API, and open weights on Hugging Face.

Features

Multilingual speech generation

Generates realistic, emotionally expressive speech and is positioned for multilingual voice generation across nine supported languages.

Instant voice adaptation

Supports custom voice adaptation from short reference audio, including accent, intonation, pauses, and other speaking nuances.

Low-latency output

Designed for low-latency streaming, with a reported 70 ms model latency for a typical 10-second voice sample and 500 characters.

Compact model size

Works with a compact 4B-parameter model footprint, which Mistral says helps keep voice-agent deployments natural and cost-effective at scale.

Cross-lingual voice prompting

Supports cross-lingual voice adaptation and can generate speech in one language using a voice prompt from another language.

Studio and API access

Can be tested in Mistral Studio, and the source says the API also includes preset voices plus the option to extend to an in-house voice library.

Use Cases

Voice agents
Generate spoken responses for assistants and agents that need natural, expressive voice output rather than a flat readout of text.
Multilingual localization
Localize customer-facing audio into supported languages while keeping the delivered speech consistent with a reference voice or accent.
Cross-lingual translation
Create speech-to-speech translation flows where the generated output should retain the character of a source voice while changing language.
Voice prototyping
Prototype or refine a branded in-house voice by testing voice references in Mistral Studio before wiring the model into production systems.
Enterprise speech pipelines
Use the API or open weights to add speech output to existing LLM or speech-to-text pipelines without replacing the rest of the stack.

Pros and Cons

Pros

Supports nine major languages and several dialects, making it suitable for multilingual voice generation.
Can adapt to a custom voice from a short reference sample and preserve speaking style details like rhythm and intonation.
Emphasizes low latency for voice-agent use cases and streaming output.
Offers multiple access paths, including Mistral Studio, Le Chat, API usage, and open weights on Hugging Face.
Built with cross-lingual voice adaptation in mind, which can support speech-to-speech translation workflows.

Cons

The product page gives limited public detail about integration patterns beyond Mistral Studio, Le Chat, API, and Hugging Face availability.
The announcement does not provide a full pricing tier breakdown for Voxtral TTS beyond an API rate.
The open-weights release is described as available under CC BY NC 4.0, which may not fit every commercial use case without checking the terms.

FAQ

How can I access Voxtral TTS?

Voxtral TTS is available now via API, and Mistral also says it can be tried in Mistral Studio and in Le Chat.

Which languages does Voxtral TTS support?

The source says it supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

How much reference audio does Voxtral TTS need?

The model is described as taking a voice prompt of about 5 to 25 seconds and a text prompt. Mistral also says it can adapt to a custom voice with as little as 3 seconds of reference audio.

Can Voxtral TTS generate long audio clips?

The announcement says the API handles arbitrarily long generations with smart interleaving, while the model itself natively generates up to two minutes of audio.

Is Voxtral TTS open weights?

Mistral says a model with several reference voices is available as open weights on Hugging Face under CC BY NC 4.0.

Quick Facts

Category: Text to speech
Product: Voxtral TTS
Vendor: Mistral AI
Source domain: mistral.ai
Languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
Access: API, Mistral Studio, Le Chat, open weights on Hugging Face

Voxtral TTS Alternatives

Wallie

Wallie is an open-source AI streamer that watches your screen, hears chat, and delivers live commentary in a configurable persona. Runs locally with your own keys.

Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS is Google’s preview text-to-speech model for expressive AI speech with fine-grained style and delivery control across Gemini API, Google AI Studio, Vertex AI, and Google Vids.

蓝藻AI

蓝藻AI is an online AI voice generation and dubbing platform that turns text into speech and supports self-service voice cloning for short videos and audiobooks.

Ondoku

Ondoku is a browser-based text-to-speech tool that turns text into downloadable .mp3 audio, with free and paid plans, multilingual reading, image reading, and commercial use options.

PXZ AI

An All-In-One AI Platform that combines tools for image, video, voice, writing, and chat to enhance creativity and collaboration.

Gemma AI

Gemma AI is a phone call reminder app that calls you with scheduled reminders instead of push notifications, with Google Calendar sync and natural voice interaction.