Grok Speech to Text and Text to Speech APIs icon

Grok Speech to Text and Text to Speech APIs

xAI’s Grok Speech to Text and Text to Speech APIs let developers add transcription and speech generation to apps through REST and WebSocket endpoints. The product supports multilingual STT, expressive TTS, and usage-based pricing.

Grok Speech to Text and Text to Speech APIs

Overview

Grok Speech to Text and Text to Speech APIs are standalone audio endpoints from xAI for developers who need speech recognition and speech synthesis in their applications. The product provides two separate APIs: Grok STT for transcription and Grok TTS for generated speech.

The APIs are built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. xAI positions them for voice agents, real-time transcription tools, accessibility solutions, podcasts, and other interactive audio workflows.

Core capabilities

Batch and streaming transcription

Generate transcripts from large audio files through a REST workflow, or transcribe speech as it happens with a low-latency WebSocket API.

Structured transcription output

Improve transcript usability with word-level timestamps, speaker diarization, multichannel handling, and inverse text normalization for numbers, dates, currencies, and similar text.

Multilingual support

Work across 25+ languages and switch languages without interrupting the transcription flow.

Text-to-speech generation

Create speech from text with natural, expressive voices through REST or real-time WebSocket endpoints.

Fine-grained voice control

Control delivery with inline and wrapping speech tags such as `[laugh]`, `[sigh]`, `[whisper]`, `<emphasis>`, `<slow>`, and `<pause>`.

Simple usage-based billing

Use usage-based pricing with separate rates for batch STT, streaming STT, and TTS.

Practical use cases

  • Transcription pipelines

    Add speech recognition to customer-facing or internal tools that need batch transcription of uploaded audio or live transcription during a conversation.

  • Voice agents and assistants

    Build assistants that listen to spoken input and return structured text with timestamps, speaker labels, and normalized entities.

  • Text-to-speech output

    Generate spoken output for narrated content, interactive experiences, or accessibility features where natural delivery matters.

  • Meetings and call analysis

    Handle multi-speaker recordings such as meetings, interviews, or support calls where speaker separation and multichannel input improve readability.

  • Multilingual audio workflows

    Support teams that work across languages and need a transcription system that can switch among many languages without changing the workflow.

Pros and Cons

Pros

  • Provides separate STT and TTS APIs for different audio workflows.
  • Supports both REST and WebSocket access for batch and real-time use.
  • Includes transcript features such as timestamps, diarization, multichannel support, and inverse text normalization.
  • Offers multilingual STT across 25+ languages.
  • Gives TTS users speech tags for inline control over emphasis, pauses, whispering, and other delivery details.

Cons

  • The page does not publish a single all-in-one price; billing is split across STT and TTS usage types.
  • Full rate limits are not listed on the announcement page and are referenced in the xAI API console.
  • The announcement page does not spell out SDKs, supported programming languages, or integration partners.

FAQ

What are Grok STT and TTS APIs for?

The APIs are standalone audio endpoints for developers who want to add speech recognition or speech synthesis to an application. The page highlights use cases such as voice agents, real-time transcription, accessibility tools, podcasts, and interactive audio experiences.

Which interfaces do the APIs support?

Speech to Text is available through both a batch REST API and a real-time WebSocket API. Text to Speech is also available through REST and WebSocket endpoints, so developers can choose batch or streaming workflows.

What transcription features does Grok STT include?

The speech APIs support multilingual transcription across 25+ languages. The page also highlights word-level timestamps, speaker diarization, multichannel support, and inverse text normalization for structured transcript output.

How does Grok TTS handle voice style and expression?

Text to Speech supports natural, expressive voices and speech tags for fine-grained control over prosody and emotion. The page lists inline and wrapping tags such as `[laugh]`, `[sigh]`, `[whisper]`, `<emphasis>`, `<slow>`, and `<pause>` as examples.

How is pricing structured?

Pricing is usage-based. The page states Speech to Text is priced per hour for batch and streaming usage, and Text to Speech is priced per million characters. Current rate limits and full details are available in the xAI API console.

Quick Facts

Category
Developer Tool
Source domain
x.ai
Primary products
Grok Speech to Text and Grok Text to Speech
Access patterns
REST API and WebSocket API
Pricing model
Usage-based billing
Language support
25+ languages for STT

Alternativas ao Grok Speech to Text and Text to Speech APIs