UStackUStack
Voxtral favicon

Voxtral

Voxtral est une plateforme avancée de reconnaissance vocale en texte, offrant une transcription en temps réel et par lots avec diarisation, support multilingue et faible latence, adaptée aux entreprises et aux développeurs.

Voxtral

What is Voxtral?

What is Voxtral

Voxtral is an advanced speech-to-text solution developed by Mistral AI, designed to deliver high-accuracy, real-time, and batch transcription services. It leverages next-generation models to provide industry-leading transcription quality, speaker diarization, and low-latency processing, making it suitable for a wide range of voice-driven applications. Voxtral's suite includes both batch and live transcription models, optimized for different use cases, and is built with privacy and efficiency in mind.

The platform is distinguished by its ability to handle multilingual transcription across 13 languages, support for long audio recordings up to three hours, and its open-source availability of models under the Apache 2.0 license. It also features an intuitive audio playground within Mistral Studio, allowing users to test and experiment with transcription functionalities instantly. Whether for enterprise deployment, media production, or real-time voice applications, Voxtral aims to transform how organizations utilize voice data.

Key Features

  • Voxtral Mini Transcribe V2: State-of-the-art batch transcription with speaker diarization, context biasing, and word-level timestamps in 13 languages.
  • Voxtral Realtime: Purpose-built for live transcription with configurable latency down to sub-200ms, ideal for voice agents and real-time applications.
  • Industry-leading Accuracy: Achieves the lowest word error rates across multiple languages and domains, outperforming competitors like GPT-4o mini Transcribe and Deepgram Nova.
  • Open-weights Model: Realtime model available under Apache 2.0 license, deployable on edge devices for privacy-sensitive applications.
  • Multilingual Support: Strong transcription performance in 13 languages including English, Chinese, Hindi, Spanish, Arabic, and more.
  • Efficient and Cost-effective: Delivers high accuracy at a fraction of the cost, with processing speeds approximately three times faster than some competitors.
  • Enterprise Features: Includes speaker diarization, context biasing for domain-specific vocabulary, and precise word-level timestamps.
  • Robust Noise Handling: Maintains accuracy in challenging acoustic environments such as factories, call centers, and outdoor recordings.
  • Long Audio Processing: Capable of transcribing recordings up to 3 hours in a single request.
  • Audio Playground: An interactive tool within Mistral Studio to upload, test, and customize transcription settings instantly.

How to Use Voxtral

Getting started with Voxtral is straightforward. Users can access the platform via Mistral Studio, where they can upload audio files in formats such as MP3, WAV, M4A, FLAC, or OGG, with each file up to 1GB. For batch transcription, upload your audio, select the desired language, and choose options like diarization, timestamps, and context biasing. The system processes the audio and provides transcriptions with speaker labels, timestamps, and domain-specific vocabulary if configured.

For real-time applications, developers can integrate Voxtral Realtime into their voice-enabled systems. The model's streaming architecture allows transcriptions with latency configurable down to under 200 milliseconds. Deployment can be on cloud or edge devices, thanks to the open-source weights, enabling privacy-focused solutions.

The audio playground in Mistral Studio allows users to test the models instantly by uploading sample files, toggling features, and adjusting settings to see results in real-time. This makes it easy for developers and enterprises to evaluate the technology before integration.

Use Cases

  • Meeting and Conference Transcription: Automatically transcribe meetings, webinars, and conferences with speaker diarization and timestamps for easy reference.
  • Customer Support and Call Centers: Enable real-time transcription of customer calls for better analysis, quality assurance, and agent support.
  • Media and Content Production: Generate subtitles, captions, and searchable audio content for videos, podcasts, and broadcasts.
  • Voice Assistants and Voice-Enabled Devices: Power voice agents with low-latency, accurate speech recognition for seamless user interaction.
  • Legal and Medical Documentation: Transcribe interviews, depositions, and medical consultations with high accuracy and privacy compliance.

FAQ

Q1: What languages does Voxtral support? A1: Voxtral supports 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Q2: Is the Voxtral Realtime model open-source? A2: Yes, the Realtime model weights are available under the Apache 2.0 license on the Hugging Face Hub, allowing for deployment on edge devices.

Q3: How much does Voxtral cost? A3: Pricing details vary based on usage, but Voxtral Mini Transcribe V2 offers a cost-effective solution at approximately $0.003 per minute of audio.

Q4: Can Voxtral handle long recordings? A4: Yes, it can process recordings up to 3 hours in a single request.

Q5: What are the system requirements for deploying Voxtral models? A5: The models are efficient, with a 4B parameter footprint, suitable for deployment on edge devices and cloud environments, depending on your infrastructure.