OpenAI Realtime API

What is OpenAI Realtime API?

The OpenAI Realtime API provides low-latency communication between your application and models that natively support speech-to-speech interactions. It also supports multimodal inputs—audio, images, and text—and multimodal outputs—audio and text—making it suitable for interactive voice experiences.

Beyond voice agents, the Realtime API can be used for real-time audio transcription by streaming audio over a WebSocket connection. The documentation also highlights recommended starting points (such as the Agents SDK for TypeScript) for browser-based voice agent workflows.

Key Features

Low-latency speech-to-speech interactions: Designed for real-time, conversational audio experiences rather than request/response only.
Multimodal inputs (audio, images, text): Lets a single session accept different input types depending on the application flow.
Multimodal outputs (audio and text): Supports returning either audio, text, or both as part of the interaction.
Multiple connection methods: Choose between WebRTC (browser/client-side), WebSocket (middle-tier server-side with consistent low latency), and SIP (VoIP telephony).
Session and conversation tooling guides: Includes guidance on prompting, conversation lifecycle events, and managing session behavior on the server.
Realtime transcription over WebSocket: Provides a path for transcribing audio streams in real time.

How to Use OpenAI Realtime API

Pick a connection method based on where your app runs: WebRTC for browser/client use, WebSocket for server/middle-tier, or SIP for VoIP telephony.
Start with a session. For browser voice agents, the docs recommend using the Agents SDK for TypeScript, which uses WebRTC in the browser and WebSocket on the server.
Create and initialize a session in your code, then connect using a client API key (example shown uses RealtimeAgent and RealtimeSession with session.connect).
Interact with the model using events. After connecting, use the provided guides for prompting/steering, conversation lifecycle management, and (when needed) server-side control via webhooks.

The documentation also notes GA migration details (see FAQ) that affect how you authenticate Realtime requests.

Use Cases

Browser-based voice agent with speech-to-speech: Use WebRTC (often via the Agents SDK for TypeScript) to connect a microphone and audio output for interactive conversation.
Server-backed realtime assistant: Use a WebSocket connection from a middle tier when you want consistent low-latency networking and centralized session handling.
VoIP/telephony integration: Connect via SIP when your target deployment is a telephony environment rather than a web browser.
Real-time audio transcription: Stream audio to a Realtime transcription flow over WebSocket to receive transcription results while audio is being sent.
Multimodal interaction: Accept audio alongside images and text in a single realtime session, then return either audio, text, or both.

FAQ

Do I need the beta header when using the GA Realtime API?

For GA requests, the documentation states the OpenAI-Beta: realtime=v1 header should be removed. If you want to retain beta behavior, you should continue to include that header.

How do I generate credentials for client-side (browser) Realtime sessions?

In the GA interface, the docs describe a single REST endpoint—POST /v1/realtime/client_secrets—to generate keys used for initializing a WebRTC or WebSocket connection from clients. The example shows creating a session configuration and posting it to that endpoint.

How do WebRTC and WebSocket differ in where they run?

The documentation positions WebRTC as ideal for browser/client-side interactions, while WebSocket is ideal for middle-tier server-side applications with consistent low-latency network connections.

What URL change applies to WebRTC SDP retrieval?

When initializing a WebRTC session in the browser, the docs state the URL for obtaining remote session information via SDP is now /v1/realtime/calls.

Can I use the Realtime API for transcription without full voice-agent behavior?

Yes. The documentation specifically calls out realtime audio transcription by transcribing audio streams in real time over a WebSocket connection.

Alternatives

Use the Agents SDK for TypeScript without building everything directly on Realtime primitives: This keeps you focused on voice agent orchestration while still leveraging Realtime under the hood for browser (WebRTC) and server (WebSocket) connectivity.
Build a request/response transcription pipeline instead of streaming: If your app doesn’t require real-time audio handling, a non-realtime transcription workflow avoids the event-driven session approach described for Realtime.
Other realtime communication approaches for voice: If you need telephony-specific flows, SIP-based integration is one option within the Realtime connection methods; otherwise, choose between WebRTC (browser) and WebSocket (server) depending on deployment.
Multimodal chat with non-realtime endpoints: If latency requirements are less strict than “low-latency communication,” a non-realtime multimodal chat approach may fit, though it won’t follow the same streaming/event session workflow described in Realtime docs.