Grok Speech to Text and Text to Speech APIs

What is Grok Speech to Text (STT) and Text to Speech (TTS)?

Grok Speech to Text (STT) and Grok Text to Speech (TTS) are standalone audio APIs from xAI for converting speech to text and text to speech. They’re designed so developers can add voice capabilities to their own applications using REST and WebSocket endpoints.

The goal of Grok STT is to produce accurate transcripts with structured output options. Grok TTS focuses on turning text into speech with natural, expressive delivery and fine-grained control over prosody through speech tags.

Key Features

High-accuracy, low-latency transcription: Generate transcripts from large audio files using the REST API and transcribe speech in real time using a WebSocket API.
Word-level timestamps and speaker diarization: Includes word-level speaker IDs via diarization to separate and identify speakers in both pre-recorded and streaming audio.
Multichannel support: Transcribe multichannel audio files with speaker separation handled through the same API.
Inverse Text Normalization (with formatting enabled): Converts spoken language into structured, properly formatted outputs for items such as numbers, dates, and currencies (e.g., transforming “my phone number is …” into the expected formatted form).
Multilingual speech recognition: Supports 25+ languages and allows seamless switching between languages.
Speech tags for expressive TTS: Use inline and wrapping speech tags such as [laugh], [sigh], [whisper], , , and to control delivery.
REST and WebSocket generation for TTS: Create speech from text with REST for batch-style generation and use WebSocket for real-time speech output.

How to Use Grok Speech to Text (STT) and Text to Speech (TTS)

Start with the xAI API console and use the provided endpoints for either STT or TTS.
For transcription, choose REST when you want to transcribe large audio files and WebSocket when you need low-latency, real-time transcription.
For TTS, submit text via REST to generate speech, or use WebSocket if you need real-time speech output.
If you require structured transcripts, enable formatting to use inverse text normalization. For TTS expressiveness, add speech tags to control prosody.

Use Cases

Voice agents and interactive assistants: Transcribe user speech in real time and feed the resulting text into your dialog or workflow logic.
Real-time transcription for meetings or support calls: Use diarization and word-level speaker IDs to attribute parts of the conversation to the correct speaker.
Accessibility tools: Convert spoken language into properly structured text (including numbers, dates, and currency) and optionally support multiple languages.
Podcasts and audio production workflows: Generate transcripts from longer recordings (batch transcription) and use TTS to turn scripts or structured text back into audio.
Interactive audio experiences: Combine controlled TTS (speech tags for emphasis, pauses, and expressive cues) with transcription to support two-way voice interactions.

FAQ

What endpoints are available for transcription and speech generation? Grok STT and Grok TTS both mention REST endpoints for batch-style requests and WebSocket endpoints for low-latency or real-time use.

Does Grok STT support speaker identification? Yes. The API includes speaker diarization and word-level speaker IDs for both pre-recorded and real-time streaming audio.

Is formatting or structured output available for transcriptions? Yes. With formatting enabled, Grok STT applies Inverse Text Normalization to convert spoken language into structured output for items such as numbers, dates, and currencies.

How many languages does Grok STT support? The page states support for 25+ languages and notes that languages can be switched without missing a beat.

How can I control TTS delivery style? Grok TTS provides speech tags (for example [laugh], [sigh], [whisper], , , and ) that you can include in text to control prosody and emotion.

Alternatives

Speech-to-text APIs (general category): Other STT providers offer REST/WebSocket transcription with options like diarization and punctuation/formatting. Compare them based on latency, diarization quality, and how they handle inverse text normalization.
Text-to-speech APIs with markup/tags (general category): Many TTS APIs support SSML-like or custom tagging to influence prosody. Compare tag expressiveness, supported controls, and whether you need REST vs real-time WebSocket generation.
Building custom audio pipelines (general category): Some teams may assemble ASR and formatting components themselves (separate transcription + normalization). This can increase integration complexity but may offer more control over each step.
Using a conversational voice platform vs standalone APIs: Instead of standalone STT/TTS endpoints, you can adopt end-to-end voice agent platforms. This typically trades flexibility of standalone APIs for a more integrated workflow.