Realtime and audio icon

Realtime and audio

An OpenAI API guide for choosing the right speech architecture for live audio, translation, transcription, speech generation, and audio-capable chat. It helps developers map each speech application to the appropriate session type, endpoint, and connection method.

Realtime and audio

Overview

Realtime and audio is an OpenAI API guide for choosing the right speech architecture for a specific application. It distinguishes between Realtime sessions for live, low-latency audio and request-based audio APIs for file-based, bounded, or generated speech workflows.

The guide covers voice agents, live translation, realtime transcription, speech generation, and audio-capable chat models. It also explains session types, transport choices, safety identifiers, and the changes needed when migrating a beta Realtime integration to the GA interface.

Core capabilities

Session types for different speech workflows

Choose between voice-agent, translation, and transcription session types based on whether the app needs responses, live translation, or transcript-only output.

Persistent live audio connections

Keep a Realtime session open while the client sends audio, receives events, and updates session state in real time.

Browser-ready voice-agent path

Build browser voice agents with the Agents SDK and WebRTC, with the option to connect to server-side tools.

Dedicated realtime translation flow

Use a dedicated translation endpoint for continuous speech translation instead of the standard assistant turn lifecycle.

Configurable realtime transcription latency

Tune realtime transcription with gpt-realtime-whisper latency controls so you can trade off earlier partial text against transcript quality.

Transport options matched to audio source

Select WebRTC, WebSocket, or SIP based on where audio is captured and played, from browser clients to telephony systems.

Common use cases

  • Voice agents

    Build an assistant that listens to live audio, responds to the user, calls tools, and maintains conversation state in the same session.

  • Live translation

    Translate speech as it is spoken using a dedicated realtime translation session that streams translated audio and transcript deltas.

  • Transcription

    Turn streaming audio into transcript deltas, or process audio files into text when you do not need model-generated spoken responses.

  • Speech generation

    Generate natural-sounding spoken audio from text with request-based speech generation models.

  • Audio-capable chat

    Add audio to an existing Chat Completions app using audio-capable chat models when you want to extend a text-first workflow.

Pros and Cons

Pros

  • Helps developers choose between voice agents, translation, transcription, and request-based audio paths.
  • Explains which endpoint or pattern fits each session type.
  • Covers browser, server, mobile, and telephony connection methods.
  • Includes migration guidance from beta Realtime integrations to the GA interface.
  • Adds practical notes on safety identifiers and latency tuning.

Cons

  • The guide is scoped to architecture and workflow selection, so it does not provide pricing or performance benchmarks.
  • Some connection methods and models require checking support before use, especially for SIP with translation or transcription.

FAQ

When should I use the Realtime guide versus request-based audio APIs?

Use the Realtime and audio guidance when you are choosing between a live session and a request-based audio API. Realtime sessions are best for live audio that needs low latency, while request-based audio APIs are better for files, bounded requests, or generated speech that does not need a live session.

What kind of app should use a voice-agent session?

Use a voice-agent session when the application should respond to the user, call tools, and manage conversation state. The guide also points browser voice agents toward the Voice agents guide, which uses the Agents SDK with WebRTC for browser audio and can connect to server-side tools.

What is the difference between translation and transcription sessions?

Use a translation session when the app should continuously translate speech as it arrives, and use a transcription session when the app needs live transcript deltas from streaming audio without model-generated spoken responses.

Which connection method should I choose?

WebRTC is for browser and mobile clients that capture or play audio directly. WebSocket is for server-side media pipelines, call systems, or workers that already receive raw audio, and SIP is for telephony voice agents.

Do Realtime sessions support safety identifiers?

The guide recommends adding a stable, privacy-preserving safety identifier for Realtime API requests when your application identifies individual end users. It should be sent in the OpenAI-Safety-Identifier header and kept stable across sessions for the same user.

Quick Facts

Category
Developer Tool
Product area
OpenAI API
Primary focus
Realtime speech and audio workflows
Source domain
developers.openai.com
Main session types
Voice-agent, translation, and transcription sessions
Related transport options
WebRTC, WebSocket, and SIP

Realtime and audio Alternatives