Realtime and audio

An OpenAI API guide for choosing the right speech architecture for live audio, translation, transcription, speech generation, and audio-capable chat. It helps developers map each speech application to the appropriate session type, endpoint, and connection method.

AI 음성 인식

AI 음성 비서

음성 텍스트 변환

웹사이트 방문

Overview

Realtime and audio is an OpenAI API guide for choosing the right speech architecture for a specific application. It distinguishes between Realtime sessions for live, low-latency audio and request-based audio APIs for file-based, bounded, or generated speech workflows.

The guide covers voice agents, live translation, realtime transcription, speech generation, and audio-capable chat models. It also explains session types, transport choices, safety identifiers, and the changes needed when migrating a beta Realtime integration to the GA interface.

Core capabilities

Session types for different speech workflows

Choose between voice-agent, translation, and transcription session types based on whether the app needs responses, live translation, or transcript-only output.

Persistent live audio connections

Keep a Realtime session open while the client sends audio, receives events, and updates session state in real time.

Browser-ready voice-agent path

Build browser voice agents with the Agents SDK and WebRTC, with the option to connect to server-side tools.

Dedicated realtime translation flow

Use a dedicated translation endpoint for continuous speech translation instead of the standard assistant turn lifecycle.

Configurable realtime transcription latency

Tune realtime transcription with gpt-realtime-whisper latency controls so you can trade off earlier partial text against transcript quality.

Transport options matched to audio source

Select WebRTC, WebSocket, or SIP based on where audio is captured and played, from browser clients to telephony systems.

Common use cases

Voice agents
Build an assistant that listens to live audio, responds to the user, calls tools, and maintains conversation state in the same session.
Live translation
Translate speech as it is spoken using a dedicated realtime translation session that streams translated audio and transcript deltas.
Transcription
Turn streaming audio into transcript deltas, or process audio files into text when you do not need model-generated spoken responses.
Speech generation
Generate natural-sounding spoken audio from text with request-based speech generation models.
Audio-capable chat
Add audio to an existing Chat Completions app using audio-capable chat models when you want to extend a text-first workflow.

Pros and Cons

Pros

Helps developers choose between voice agents, translation, transcription, and request-based audio paths.
Explains which endpoint or pattern fits each session type.
Covers browser, server, mobile, and telephony connection methods.
Includes migration guidance from beta Realtime integrations to the GA interface.
Adds practical notes on safety identifiers and latency tuning.

Cons

The guide is scoped to architecture and workflow selection, so it does not provide pricing or performance benchmarks.
Some connection methods and models require checking support before use, especially for SIP with translation or transcription.

FAQ

When should I use the Realtime guide versus request-based audio APIs?

Use the Realtime and audio guidance when you are choosing between a live session and a request-based audio API. Realtime sessions are best for live audio that needs low latency, while request-based audio APIs are better for files, bounded requests, or generated speech that does not need a live session.

What kind of app should use a voice-agent session?

Use a voice-agent session when the application should respond to the user, call tools, and manage conversation state. The guide also points browser voice agents toward the Voice agents guide, which uses the Agents SDK with WebRTC for browser audio and can connect to server-side tools.

What is the difference between translation and transcription sessions?

Use a translation session when the app should continuously translate speech as it arrives, and use a transcription session when the app needs live transcript deltas from streaming audio without model-generated spoken responses.

Which connection method should I choose?

WebRTC is for browser and mobile clients that capture or play audio directly. WebSocket is for server-side media pipelines, call systems, or workers that already receive raw audio, and SIP is for telephony voice agents.

Do Realtime sessions support safety identifiers?

The guide recommends adding a stable, privacy-preserving safety identifier for Realtime API requests when your application identifies individual end users. It should be sent in the OpenAI-Safety-Identifier header and kept stable across sessions for the same user.

Quick Facts

Category: Developer Tool
Product area: OpenAI API
Primary focus: Realtime speech and audio workflows
Source domain: developers.openai.com
Main session types: Voice-agent, translation, and transcription sessions
Related transport options: WebRTC, WebSocket, and SIP

Realtime and audio 대안

Lemon

Lemon is a Mac voice assistant that turns spoken instructions into finished writing tasks and other actions. It offers a free Basic plan, a paid Pro plan, and a workflow centered on pressing fn, speaking, and staying in the same tab.

QuickQuill

QuickQuill is a macOS dictation and transcription app that runs locally on the device. It helps users record meetings, transcribe audio, generate summaries, and export notes without using a cloud service.

Speech to Text Converter

Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. It offers a free tier for short tasks and a Pro plan for unlimited transcription, AI summaries, translation, speaker identification, and advanced exports.

Pewbeam

Pewbeam is a church presentation app that listens to sermons, detects Bible verse references in real time, and displays the matching passage on screen. It is built for pastors, projection teams, and church media volunteers who want to reduce manual slide control during live services.

PXZ AI

이미지, 비디오, 음성, 글쓰기 및 채팅 도구를 통합한 올인원 AI 플랫폼으로, 창의성과 협업을 향상시킵니다.

Gemma AI

Gemma AI is a phone call reminder app that calls you with scheduled reminders instead of push notifications. It helps people who want a more direct way to stay on schedule, with Google Calendar sync and conversational call interactions.