Voxtral Transcribe 2

Voxtral Transcribe 2 is Mistral AI’s speech-to-text family for batch and live transcription. It combines diarization, timestamps, multilingual support, and an audio playground in Mistral Studio for testing before integration.

AI Распознавание речи

Транскрибация

Речь в текст

Посетить Сайт

Overview

Voxtral Transcribe 2 is Mistral AI’s speech-to-text offering, introduced as two next-generation models for transcription: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. The launch focuses on transcription quality, speaker diarization, low latency, and language coverage rather than on a broader conversation or agent platform.

The product page says Voxtral Mini Transcribe V2 provides state-of-the-art transcription with diarization, context biasing, and word-level timestamps in 13 languages, while Voxtral Realtime is designed for streaming audio with latency configurable down to sub-200ms. Mistral also says the Realtime model is open-weights under Apache 2.0, and that an audio playground in Mistral Studio lets users test transcription with diarization and timestamps before building against the API.

The source positions Voxtral for workflows such as meeting transcription, voice agents, contact center automation, media subtitling, and compliance documentation. Pricing details in the article indicate API access for both models, with Mini Transcribe V2 at $0.003 per minute and Realtime at $0.006 per minute, plus an open-weights release for Realtime on Hugging Face.

Key capabilities

Speaker diarization

Voxtral Mini Transcribe V2 generates speaker-labeled transcripts with precise start and end times, which is useful when you need to know who said what and when.

Context biasing

You can provide up to 100 words or phrases to bias the model toward names, technical terms, and other vocabulary that standard transcription systems may miss.

Word-level timestamps

The model can return timestamps for each word, supporting subtitle creation, searchable archives, and time-aligned content workflows.

Expanded multilingual support

Both models support 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Batch and live modes

Voxtral Realtime is built for live audio and supports latency configurable down to sub-200ms, while Mini Transcribe V2 is positioned for batch transcription.

Audio playground in Studio

The product description highlights an audio playground in Mistral Studio for immediate testing with diarization, timestamps, and audio file uploads.

Common use cases

Meeting notes and recaps
Transcribe recurring meetings with speaker labels and timestamps so teams can review decisions, assignments, and discussion flow after the call.
Voice agents and assistants
Power conversational agents and assistant experiences that need transcription latency low enough to keep voice interactions responsive.
Contact center workflows
Process customer support or sales calls as they happen, using diarization to separate agent and customer speech for later analysis or CRM entry.
Media and subtitles
Generate live or near-live subtitles for multilingual media, where low latency and word-level timing help align speech with on-screen captions.
Compliance and audit records
Record regulated or sensitive conversations with diarization and timestamps to create clearer audit trails for review and documentation.

Pros and Cons

Pros

Offers both batch and low-latency transcription options under one product family.
Includes speaker diarization and word-level timestamps for more structured transcripts.
Supports 13 languages, including major global languages across Europe and Asia.
Provides an audio playground in Mistral Studio for quick testing before integration.
Voxtral Realtime is available as open weights under Apache 2.0 for edge or private deployment scenarios.

Cons

The public source is a launch article, so setup instructions, SDK details, and deployment examples are limited.
Context biasing is described as optimized for English, with support for other languages marked experimental.
The article notes that with overlapping speech the model typically transcribes one speaker, which may be a limitation in dense multi-party audio.

FAQ

What is Voxtral Transcribe 2?

Voxtral Transcribe 2 is a speech-to-text product family with two model options: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. The article also mentions an audio playground in Mistral Studio for testing transcription directly.

How do the two models differ?

The source describes Voxtral Mini Transcribe V2 as a batch transcription model, and Voxtral Realtime as a streaming model designed for live applications where low latency matters. The article does not provide a full API or SDK workflow beyond these product names and the Mistral Studio playground.

Can I try it in Mistral Studio?

According to the source, the audio playground in Mistral Studio supports uploading up to 10 audio files, toggling diarization, choosing timestamp granularity, and adding context bias terms. It accepts .mp3, .wav, .m4a, .flac, and .ogg files up to 1GB each.

How is Voxtral Transcribe 2 priced?

The article states that Voxtral Mini Transcribe V2 is available via API at $0.003 per minute, while Voxtral Realtime is available via API at $0.006 per minute and as open weights on Hugging Face. The pricing page also confirms that Mistral offers API usage and a Studio dashboard, but it does not add Voxtral-specific packaging details.

Can Voxtral be self-hosted or deployed privately?

The source says Voxtral Realtime is open-weights under the Apache 2.0 license and can be deployed on edge devices. It also says both models support secure on-premise or private cloud setups for GDPR- and HIPAA-compliant deployments, but the article does not provide implementation steps.

Quick Facts

Category: Speech-to-text
Product family: Voxtral Mini Transcribe V2 and Voxtral Realtime
Primary workflows: Batch transcription and live transcription
Languages: 13 languages
Studio access: Audio playground in Mistral Studio
Pricing signal: API usage is listed in the launch article; Mini Transcribe V2 at $0.003/min and Realtime at $0.006/min

Альтернативы Voxtral Transcribe 2

QuickQuill

QuickQuill is a macOS dictation and transcription app that runs locally on the device. It helps users record meetings, transcribe audio, generate summaries, and export notes without using a cloud service.

Speech to Text Converter

Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. It offers a free tier for short tasks and a Pro plan for unlimited transcription, AI summaries, translation, speaker identification, and advanced exports.

Dictato

Dictato is a Mac dictation app that transcribes speech into text in any app using an on-device, offline workflow. It supports multiple transcription engines, optional cleanup and translation, and a one-time purchase license.

Sanota

Sanota is an app that turns spoken memories, reflections, and interviews into clear written stories. It supports personal storytelling, family history, and shared memories, with guided prompts and subscription pricing.

Carbon Voice

Carbon Voice is an asynchronous voice messaging app for teams and individuals, with transcripts, AI catch-up, and cross-device access. It helps people and agents communicate without needing a live call.

Realtime and audio

An OpenAI API guide for choosing the right speech architecture for live audio, translation, transcription, speech generation, and audio-capable chat. It helps developers map each speech application to the appropriate session type, endpoint, and connection method.