Grok Speech to Text and Text to Speech APIs

xAI’s Grok Speech to Text and Text to Speech APIs let developers add transcription and speech generation to apps through REST and WebSocket endpoints. The product supports multilingual STT, expressive TTS, and usage-based pricing.

Riconoscimento Vocale IA

Sintesi Vocale AI

Voce in Testo

Visita il Sito Web

Overview

Grok Speech to Text and Text to Speech APIs are standalone audio endpoints from xAI for developers who need speech recognition and speech synthesis in their applications. The product provides two separate APIs: Grok STT for transcription and Grok TTS for generated speech.

The APIs are built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. xAI positions them for voice agents, real-time transcription tools, accessibility solutions, podcasts, and other interactive audio workflows.

Core capabilities

Batch and streaming transcription

Generate transcripts from large audio files through a REST workflow, or transcribe speech as it happens with a low-latency WebSocket API.

Structured transcription output

Improve transcript usability with word-level timestamps, speaker diarization, multichannel handling, and inverse text normalization for numbers, dates, currencies, and similar text.

Multilingual support

Work across 25+ languages and switch languages without interrupting the transcription flow.

Text-to-speech generation

Create speech from text with natural, expressive voices through REST or real-time WebSocket endpoints.

Fine-grained voice control

Control delivery with inline and wrapping speech tags such as `[laugh]`, `[sigh]`, `[whisper]`, `<emphasis>`, `<slow>`, and `<pause>`.

Simple usage-based billing

Use usage-based pricing with separate rates for batch STT, streaming STT, and TTS.

Practical use cases

Transcription pipelines
Add speech recognition to customer-facing or internal tools that need batch transcription of uploaded audio or live transcription during a conversation.
Voice agents and assistants
Build assistants that listen to spoken input and return structured text with timestamps, speaker labels, and normalized entities.
Text-to-speech output
Generate spoken output for narrated content, interactive experiences, or accessibility features where natural delivery matters.
Meetings and call analysis
Handle multi-speaker recordings such as meetings, interviews, or support calls where speaker separation and multichannel input improve readability.
Multilingual audio workflows
Support teams that work across languages and need a transcription system that can switch among many languages without changing the workflow.

Pros and Cons

Pros

Provides separate STT and TTS APIs for different audio workflows.
Supports both REST and WebSocket access for batch and real-time use.
Includes transcript features such as timestamps, diarization, multichannel support, and inverse text normalization.
Offers multilingual STT across 25+ languages.
Gives TTS users speech tags for inline control over emphasis, pauses, whispering, and other delivery details.

Cons

The page does not publish a single all-in-one price; billing is split across STT and TTS usage types.
Full rate limits are not listed on the announcement page and are referenced in the xAI API console.
The announcement page does not spell out SDKs, supported programming languages, or integration partners.

FAQ

What are Grok STT and TTS APIs for?

The APIs are standalone audio endpoints for developers who want to add speech recognition or speech synthesis to an application. The page highlights use cases such as voice agents, real-time transcription, accessibility tools, podcasts, and interactive audio experiences.

Which interfaces do the APIs support?

Speech to Text is available through both a batch REST API and a real-time WebSocket API. Text to Speech is also available through REST and WebSocket endpoints, so developers can choose batch or streaming workflows.

What transcription features does Grok STT include?

The speech APIs support multilingual transcription across 25+ languages. The page also highlights word-level timestamps, speaker diarization, multichannel support, and inverse text normalization for structured transcript output.

How does Grok TTS handle voice style and expression?

Text to Speech supports natural, expressive voices and speech tags for fine-grained control over prosody and emotion. The page lists inline and wrapping tags such as `[laugh]`, `[sigh]`, `[whisper]`, `<emphasis>`, `<slow>`, and `<pause>` as examples.

How is pricing structured?

Pricing is usage-based. The page states Speech to Text is priced per hour for batch and streaming usage, and Text to Speech is priced per million characters. Current rate limits and full details are available in the xAI API console.

Quick Facts

Category: Developer Tool
Source domain: x.ai
Primary products: Grok Speech to Text and Grok Text to Speech
Access patterns: REST API and WebSocket API
Pricing model: Usage-based billing
Language support: 25+ languages for STT

Alternative a Grok Speech to Text and Text to Speech APIs

Sanota

Sanota is an app that turns spoken memories, reflections, and interviews into clear written stories. It supports personal storytelling, family history, and shared memories, with guided prompts and subscription pricing.

Carbon Voice

Carbon Voice is an asynchronous voice messaging app for teams and individuals, with transcripts, AI catch-up, and cross-device access. It helps people and agents communicate without needing a live call.

Talkpal

Talkpal is an AI-powered language learning web and mobile app for practicing speaking, listening, writing, and pronunciation. It offers guided courses, roleplays, and call-style conversation practice across 130+ languages.

Speech to Text Converter

Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. It offers a free tier for short tasks and a Pro plan for unlimited transcription, AI summaries, translation, speaker identification, and advanced exports.

MiniCPM-o 4.5

MiniCPM-o 4.5 is a multimodal AI model on Hugging Face for vision, speech, text, and full-duplex live streaming. It supports local and server-side inference paths, including PyTorch, llama.cpp, Ollama, vLLM, SGLang, and quantized formats.

Dictato

Dictato is a Mac dictation app that transcribes speech into text in any app using an on-device, offline workflow. It supports multiple transcription engines, optional cleanup and translation, and a one-time purchase license.