Batch and streaming transcription
Generate transcripts from large audio files through a REST workflow, or transcribe speech as it happens with a low-latency WebSocket API.
Grok Speech to Text and Text to Speech APIs are standalone audio endpoints from xAI for developers who need speech recognition and speech synthesis in their applications. The product provides two separate APIs: Grok STT for transcription and Grok TTS for generated speech.
The APIs are built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. xAI positions them for voice agents, real-time transcription tools, accessibility solutions, podcasts, and other interactive audio workflows.
Generate transcripts from large audio files through a REST workflow, or transcribe speech as it happens with a low-latency WebSocket API.
Improve transcript usability with word-level timestamps, speaker diarization, multichannel handling, and inverse text normalization for numbers, dates, currencies, and similar text.
Work across 25+ languages and switch languages without interrupting the transcription flow.
Create speech from text with natural, expressive voices through REST or real-time WebSocket endpoints.
Control delivery with inline and wrapping speech tags such as `[laugh]`, `[sigh]`, `[whisper]`, `<emphasis>`, `<slow>`, and `<pause>`.
Use usage-based pricing with separate rates for batch STT, streaming STT, and TTS.
Add speech recognition to customer-facing or internal tools that need batch transcription of uploaded audio or live transcription during a conversation.
Build assistants that listen to spoken input and return structured text with timestamps, speaker labels, and normalized entities.
Generate spoken output for narrated content, interactive experiences, or accessibility features where natural delivery matters.
Handle multi-speaker recordings such as meetings, interviews, or support calls where speaker separation and multichannel input improve readability.
Support teams that work across languages and need a transcription system that can switch among many languages without changing the workflow.
The APIs are standalone audio endpoints for developers who want to add speech recognition or speech synthesis to an application. The page highlights use cases such as voice agents, real-time transcription, accessibility tools, podcasts, and interactive audio experiences.
Speech to Text is available through both a batch REST API and a real-time WebSocket API. Text to Speech is also available through REST and WebSocket endpoints, so developers can choose batch or streaming workflows.
The speech APIs support multilingual transcription across 25+ languages. The page also highlights word-level timestamps, speaker diarization, multichannel support, and inverse text normalization for structured transcript output.
Text to Speech supports natural, expressive voices and speech tags for fine-grained control over prosody and emotion. The page lists inline and wrapping tags such as `[laugh]`, `[sigh]`, `[whisper]`, `<emphasis>`, `<slow>`, and `<pause>` as examples.
Pricing is usage-based. The page states Speech to Text is priced per hour for batch and streaming usage, and Text to Speech is priced per million characters. Current rate limits and full details are available in the xAI API console.
Sanota is an app that turns spoken memories, reflections, and interviews into clear written stories. It supports personal storytelling, family history, and shared memories, with guided prompts and subscription pricing.
Carbon Voice is an asynchronous voice messaging app for teams and individuals, with transcripts, AI catch-up, and cross-device access. It helps people and agents communicate without needing a live call.
Talkpal is an AI-powered language learning web and mobile app for practicing speaking, listening, writing, and pronunciation. It offers guided courses, roleplays, and call-style conversation practice across 130+ languages.
Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. It offers a free tier for short tasks and a Pro plan for unlimited transcription, AI summaries, translation, speaker identification, and advanced exports.
MiniCPM-o 4.5 是 Hugging Face 上的多模态 AI 模型,支持视觉、语音、文本和全双工直播,适用于本地与服务器推理,兼容 PyTorch、llama.cpp、Ollama、vLLM、SGLang 和量化格式。
Dictato is a Mac dictation app that transcribes speech into text in any app using an on-device, offline workflow. It supports multiple transcription engines, optional cleanup and translation, and a one-time purchase license.