Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS is Google’s preview text-to-speech model for generating expressive AI speech with fine-grained control over style and delivery. It is available across the Gemini API, Google AI Studio, Vertex AI, and Google Vids.

AI 음성 합성

텍스트 음성 변환

웹사이트 방문

Overview

Gemini 3.1 Flash TTS is Google’s text-to-speech model for generating expressive AI speech with tighter control over how audio sounds. The launch announcement emphasizes improved naturalness, clearer pacing control, and new audio tags that let developers direct vocal style and delivery through natural-language instructions.

The model is rolling out in preview for developers through the Gemini API and Google AI Studio, for enterprises through Vertex AI, and for Workspace users through Google Vids. It supports 70+ languages, native multi-speaker dialogue, and SynthID watermarking for every generated audio output.

Features

Improved speech quality

The model is presented as Google’s most natural and expressive text-to-speech model to date, with improved speech quality and controllability.

Granular audio tags

Audio tags let users direct vocal style, pace, delivery, tone, and accent with natural-language instructions embedded in the text input.

Studio-directed performance controls

Google AI Studio adds configurable controls for scene direction, speaker-level specificity, and inline tags, helping developers shape multi-turn performances.

Seamless export to API code

Developers can export the exact voice parameters from Google AI Studio as Gemini API code for consistent reuse across projects and platforms.

Multi-speaker and multilingual support

The model supports native multi-speaker dialogue and over 70 languages, which makes it suitable for localized and conversational speech experiences.

SynthID watermarking

All generated audio is watermarked with SynthID to support detection of AI-generated content.

Use Cases

Developer speech apps
Build applications that need synthesized speech with controlled delivery, such as character voices, narrated experiences, or interactive assistants.
Voice workflow prototyping
Prototype voice experiences in Google AI Studio, refine pacing and tone with tags and notes, and export the resulting settings into Gemini API code.
Multilingual content production
Create localized speech experiences for audiences across multiple languages while keeping style and accent control consistent.
Workspace video narration
Use the model in Google Vids when you need AI-generated speech for Workspace media workflows.
Watermarked synthetic audio
Generate audio with built-in SynthID watermarking when you need detectable AI-generated speech for safer distribution.

Pros and Cons

Pros

Offers fine-grained control over vocal style, pacing, tone, and accent using audio tags.
Supports 70+ languages and native multi-speaker dialogue.
Exports studio settings as Gemini API code for repeatable workflows.
Includes SynthID watermarking on all generated audio.
Available across multiple Google surfaces, including Gemini API, Google AI Studio, Vertex AI, and Google Vids.

Cons

The source does not include pricing, plan limits, or region-by-region availability details.
Advanced control features are described mainly from the launch announcement and may need hands-on testing to evaluate in specific workflows.

FAQ

Where is Gemini 3.1 Flash TTS available?

It is rolling out for developers in preview through the Gemini API and Google AI Studio, for enterprises in preview on Vertex AI, and for Workspace users through Google Vids.

How many languages does it support?

The announcement says it supports 70+ languages and includes native multi-speaker dialogue.

What controls does it give developers over speech output?

Developers can use audio tags, Audio Profiles, Director’s Notes, and inline tags in Google AI Studio to steer vocal style, pacing, tone, accent, and speaker delivery, then export the same parameters as Gemini API code.

Is the generated audio watermarked?

All audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, which is described as an imperceptible watermark for detecting AI-generated audio.

How much does it cost?

The source does not provide pricing details on the product page, and the pricing page linked in the research set is a 404.

Quick Facts

Category: AI speech / text-to-speech
Primary users: Developers, enterprises, and Workspace users
Availability: Preview rollout
Platforms: Gemini API, Google AI Studio, Vertex AI, Google Vids
Languages: 70+ languages
Watermarking: SynthID

Gemini 3.1 Flash TTS 대안

蓝藻AI

蓝藻AI是一款在线AI配音与语音合成产品，可将文字转成语音，并支持自助声音克隆。页面信息显示它面向短视频、有声书等需要配音的内容场景。

Ondoku

Ondoku 是一款基于浏览器的文字转语音软件，可将文本转换为可下载的 .mp3 语音，并提供免费额度与付费方案。它支持多语言朗读、图片朗读以及按规则商用。

Typecast

Typecast is an online AI voice generator that turns text into life-like speech with emotional delivery and a selection of hyper-realistic voices. It is a browser-based tool for creating spoken audio from written content.

Noiz AI

Noiz AI is an AI text-to-speech, voice cloning, and voice design tool for creating lifelike speech from text. It also lets users shape voice delivery, including emotion, within the same workflow.

魔音工坊 (Moying Gongfang)

魔音工坊 (Moying Gongfang)는 텍스트를 사실적인 인간의 목소리와 다양한 억양을 사용하여 고품질의 음성으로 변환하는 지능형 온라인 텍스트 음성 변환(TTS) 플랫폼입니다.

TADA

TADA is Hume AI’s open-source speech-language model for generating speech with one-to-one text-acoustic alignment. It is aimed at developers and researchers building faster, more reliable voice systems, including on-device and long-form speech applications.