Grok Speech to Text and Text to Speech APIs
Grok Speech to Text and Text to Speech APIs by xAI convert audio and text with low-latency REST/WebSocket endpoints, multilingual support, diarization.
What is Grok Speech to Text (STT) and Text to Speech (TTS)?
Grok Speech to Text (STT) and Grok Text to Speech (TTS) are standalone audio APIs from xAI for converting speech to text and text to speech. They’re designed so developers can add voice capabilities to their own applications using REST and WebSocket endpoints.
The goal of Grok STT is to produce accurate transcripts with structured output options. Grok TTS focuses on turning text into speech with natural, expressive delivery and fine-grained control over prosody through speech tags.
Key Features
- High-accuracy, low-latency transcription: Generate transcripts from large audio files using the REST API and transcribe speech in real time using a WebSocket API.
- Word-level timestamps and speaker diarization: Includes word-level speaker IDs via diarization to separate and identify speakers in both pre-recorded and streaming audio.
- Multichannel support: Transcribe multichannel audio files with speaker separation handled through the same API.
- Inverse Text Normalization (with formatting enabled): Converts spoken language into structured, properly formatted outputs for items such as numbers, dates, and currencies (e.g., transforming “my phone number is …” into the expected formatted form).
- Multilingual speech recognition: Supports 25+ languages and allows seamless switching between languages.
- Speech tags for expressive TTS: Use inline and wrapping speech tags such as [laugh], [sigh], [whisper],
, , and to control delivery. - REST and WebSocket generation for TTS: Create speech from text with REST for batch-style generation and use WebSocket for real-time speech output.
How to Use Grok Speech to Text (STT) and Text to Speech (TTS)
- Start with the xAI API console and use the provided endpoints for either STT or TTS.
- For transcription, choose REST when you want to transcribe large audio files and WebSocket when you need low-latency, real-time transcription.
- For TTS, submit text via REST to generate speech, or use WebSocket if you need real-time speech output.
- If you require structured transcripts, enable formatting to use inverse text normalization. For TTS expressiveness, add speech tags to control prosody.
Use Cases
- Voice agents and interactive assistants: Transcribe user speech in real time and feed the resulting text into your dialog or workflow logic.
- Real-time transcription for meetings or support calls: Use diarization and word-level speaker IDs to attribute parts of the conversation to the correct speaker.
- Accessibility tools: Convert spoken language into properly structured text (including numbers, dates, and currency) and optionally support multiple languages.
- Podcasts and audio production workflows: Generate transcripts from longer recordings (batch transcription) and use TTS to turn scripts or structured text back into audio.
- Interactive audio experiences: Combine controlled TTS (speech tags for emphasis, pauses, and expressive cues) with transcription to support two-way voice interactions.
FAQ
What endpoints are available for transcription and speech generation? Grok STT and Grok TTS both mention REST endpoints for batch-style requests and WebSocket endpoints for low-latency or real-time use.
Does Grok STT support speaker identification? Yes. The API includes speaker diarization and word-level speaker IDs for both pre-recorded and real-time streaming audio.
Is formatting or structured output available for transcriptions? Yes. With formatting enabled, Grok STT applies Inverse Text Normalization to convert spoken language into structured output for items such as numbers, dates, and currencies.
How many languages does Grok STT support? The page states support for 25+ languages and notes that languages can be switched without missing a beat.
How can I control TTS delivery style?
Grok TTS provides speech tags (for example [laugh], [sigh], [whisper],
Alternatives
- Speech-to-text APIs (general category): Other STT providers offer REST/WebSocket transcription with options like diarization and punctuation/formatting. Compare them based on latency, diarization quality, and how they handle inverse text normalization.
- Text-to-speech APIs with markup/tags (general category): Many TTS APIs support SSML-like or custom tagging to influence prosody. Compare tag expressiveness, supported controls, and whether you need REST vs real-time WebSocket generation.
- Building custom audio pipelines (general category): Some teams may assemble ASR and formatting components themselves (separate transcription + normalization). This can increase integration complexity but may offer more control over each step.
- Using a conversational voice platform vs standalone APIs: Instead of standalone STT/TTS endpoints, you can adopt end-to-end voice agent platforms. This typically trades flexibility of standalone APIs for a more integrated workflow.
Alternatives
Sanota
Sanota turns your voice into clear, beautiful text—capture memories and ideas easily, then start for free.
Speech to Text Converter Online
A free online tool that converts audio and video files into accurate text transcripts in over 45 languages. It supports numerous file formats and requires no downloads or sign-ups.
MiniCPM-o 4.5
MiniCPM-o 4.5 is a highly capable multimodal AI model designed for vision, speech, and full-duplex live streaming, offering advanced visual understanding, speech synthesis, and real-time interactive capabilities in a compact 9B parameter architecture.
Dictato
Dictato is an offline voice-to-text dictation app for macOS that transcribes on-device and inserts into any app you type in. No cloud.
CAMB.AI
Turn a single live stream into a multilingual broadcast with real-time AI audio dubbing for YouTube, Twitch, X and more.
Tavus
Tavus builds AI systems for real-time, face-to-face interactions that can see, hear, and respond, with APIs for video agents, twins & companions.