AssemblyAI Voice Agent API

What is AssemblyAI Voice Agent API?

AssemblyAI Voice Agent API is an API for building voice agents that can stream audio into an application and receive voice-related output back in real time. The page positions the API as a way to add task-completion and speech understanding to a voice experience, handling key parts of voice processing so developers can focus on the agent’s product logic.

The accompanying examples indicate that the API can produce transcripts under different prompting styles (e.g., capturing clinical history evaluation details, conversational-analysis suitability, and proper nouns), and can be configured to return richer transcription structures such as audio tags, verbatim disfluency data, and speaker-role labeling.

Key Features

Real-time audio streaming (input in, output out): Designed for “stream audio in, get audio back,” supporting voice-agent workflows where the agent responds during an interaction.
Accurate transcription for task-critical entities: Example text highlights correct handling of items like emails, phone numbers, order IDs, and names, which are commonly needed for task completion.
Context-aware prompting for transcripts: Supports prompting that changes how the transcript is produced (e.g., when clinical history evaluation requires medication and dosage captured accurately).
Control over transcript detail (verbatim, disfluencies, and keyterms): Examples show options to include disfluencies (fillers, repetitions, restarts, stutters, informal speech) and to request key terms.
Audio tagging and event labeling: Shows “non-speech audio event” output and includes an example of adding tags such as “beep,” distinguishing sounds from spoken content.
Speaker roles in transcripts: Supports labeling each speaker turn with a role (e.g., formatting like [Speaker:NURSE] / [Speaker:PATIENT]).
Language detection and code-switching preservation: Includes an example where English/Spanish code-switching is preserved “as-is,” while language detection is indicated.

How to Use AssemblyAI Voice Agent API

Get an API key: The page includes a “Get your API Key” callout.
Try the live Voice Agent API demo: Use the provided “Try the Voice Agent API live” support agent to experience real-time behavior.
Build your voice agent around streamed audio: Integrate the API into your application so the agent can send audio input and receive transcription/output during the call.
Adjust transcription output with prompting and structured requests: Choose the level of transcript detail you need (e.g., verbatim disfluencies, audio tags, speaker-role labeling, language/code-switching handling) based on the task.

Use Cases

Clinical intake or clinical history evaluation support: Configure the transcript output to capture medication names and dosages and to include disfluency data (fillers, repetitions, restarts, stutters, informal speech) for more meaningful evaluation.
Conversational analysis transcripts: Produce transcripts “suitable for conversational analysis,” optionally adding tags for non-speech events (e.g., a beep) and controlling whether disfluencies are included.
Automated support lines that need reliable entity capture: Use transcript accuracy for operational details such as phone numbers, order IDs, and names so the agent can complete common customer requests.
Role-based call summaries: Label each speaker turn with roles (like nurse/patient) to make downstream processing easier for workflows that depend on who said what.
Bilingual voice interactions: Preserve natural code-switching between English and Spanish so the transcript reflects what was spoken without forcing a single language.

FAQ

Is the live demo agent the same one I can build with the API?

Yes. The page notes that the support agent shown in the live demo is built on the Voice Agent API—the same one you can ship with.

Does the demo agent provide support for other products?

No. The page states the agent provides customer support for AssemblyAI products only.

Can the agent return transcripts with disfluencies included?

The examples indicate that transcript generation can be prompted to include disfluency information such as fillers, repetitions, restarts, stutters, and informal speech.

Can transcripts include non-speech audio tags?

Yes. The examples show “audio tags” and a case where a beep is included as a tag during transcript generation.

Can it handle multiple languages or code-switching?

The page includes an example of language detection and preserving natural code-switching between English and Spanish.

Alternatives

Speech-to-text APIs with configurable punctuation/diarization: If you mainly need transcription, a standard speech-to-text API with speaker diarization can be an alternative; however, you may need additional work to replicate the same transcript prompting controls and audio-tagging behavior shown here.
Generic voice agent frameworks (LLM orchestration + speech models): You can also use a voice-agent framework that combines streaming ASR/TTS and an LLM. This may shift the burden of prompt-driven transcript formatting and structured outputs to your own pipeline.
Customer support IVR/voice platforms: For support-line automation, IVR-style platforms can handle common call flows, but they may not offer the same transcript-level control (e.g., verbatim disfluencies, audio tags, speaker-role labels) intended for downstream analysis.
Meeting/call transcription tools with speaker labels: These tools can produce transcripts with speaker attribution; you would compare them based on whether they support the same level of disfluency capture and configurable transcription behaviors demonstrated in the API examples.