UStackUStack
MiniCPM-o 4_5 icon

MiniCPM-o 4_5

MiniCPM-o 4_5 is a 9B omni-modal model for full-duplex live interaction with vision, speech, and text—real-time concurrent streaming output.

MiniCPM-o 4_5

What is MiniCPM-o 4_5?

MiniCPM-o 4_5 is an open model for end-to-end omni-modal live interaction that combines vision, speech, and text. It’s designed to work with real-time video and audio streams so the model can perceive what’s happening and respond with both concurrent text and speech output.

The model is built in an end-to-end fashion using components including SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B, with a stated total size of 9B parameters. Its core purpose is to enable full-duplex multimodal streaming—processing continuous inputs while generating outputs without mutually blocking.

Key Features

  • Full-duplex multimodal live streaming (text + speech): Processes continuous video and audio input streams simultaneously while generating concurrent text and speech outputs, enabling “see, listen, and speak” in a fluid real-time interaction loop.
  • Proactive interaction at ~1Hz decision frequency: Continuously monitors the input video/audio and decides at a frequency of 1Hz whether to speak, supporting proactive behaviors like initiating reminders or comments based on ongoing scene understanding.
  • Single-model instruct and thinking modes: Supports both “instruct” and “thinking” modes within the same model configuration to cover different efficiency/performance trade-offs across scenarios.
  • Bilingual real-time speech conversation with configurable voices: Supports real-time bilingual (English/Chinese) speech conversation and includes configurable voices for speech output.
  • Voice cloning and role play via reference audio: Enables voice cloning and role play using a simple reference audio clip during inference, with the page stating cloning performance surpasses tools such as CosyVoice2.
  • High-resolution and video throughput for multimodal inputs: Can process high-resolution images (up to 1.8 million pixels) and high-FPS videos (up to 10fps) in any aspect ratio efficiently.
  • OCR/document parsing for English documents: Provides end-to-end English document parsing performance on OmniDocBench, and the page notes it outperforms proprietary models cited on the page and specialized OCR tooling such as DeepSeek-OCR 2.
  • Multilingual capability (30+ languages): Includes multilingual support stated as more than 30 languages.
  • Configurable inference options for local use: Supports PyTorch inference on NVIDIA GPUs, plus end-side adaptation via llama.cpp and Ollama (CPU inference), quantized int4/GGUF models in multiple sizes, vLLM and SGLang for high-throughput/memory-efficient inference, and FlagOS for a unified multi-chip backend plugin.

How to Use MiniCPM-o 4_5

  1. Choose an inference path based on your hardware: PyTorch on an NVIDIA GPU for straightforward acceleration, or an end-side option such as llama.cpp/Ollama for CPU inference.
  2. Start from the provided demos: the page states there are open-sourced web demos that provide the full-duplex multimodal live streaming experience on local devices (e.g., GPUs/PCs such as a MacBook).
  3. Run inference using one of the supported backends (vLLM, SGLang, quantized GGUF/int4, or FlagOS plugin) depending on whether you prioritize throughput, memory efficiency, or compact deployment.

Use Cases

  • Full-duplex live tutoring or assistance on a phone/workstation: Use continuous audio/video input to support conversational, real-time responses that include both text and spoken output.
  • Live meeting or studio-style commentary: Monitor ongoing scenes and trigger proactive comments or reminders without waiting for purely reactive turn-taking.
  • Bilingual customer support with voice personalization: Enable real-time English/Chinese speech conversation and configure speech voices; optionally use voice cloning/role play when appropriate.
  • Document capture and parsing in real time: Feed high-resolution images to perform end-to-end English document parsing, aiming for structured outputs from documents rather than OCR-only workflows.
  • Multilingual scene understanding: Use the model’s stated >30-language capability to handle multilingual instructions or responses alongside visual inputs.

FAQ

  • What modalities does MiniCPM-o 4_5 support? The page describes support for vision (images/video), speech (bilingual real-time conversation), and text, with full-duplex live streaming where outputs can be generated concurrently with incoming streams.

  • Can it generate speech while it’s still receiving new audio/video? Yes. The model’s full-duplex streaming mechanism is described as processing input streams simultaneously while generating concurrent text and speech outputs without mutual blocking.

  • Does MiniCPM-o 4_5 include voice customization? Yes. It supports configurable voices for English/Chinese and includes voice cloning and role play using a reference audio clip during inference.

  • What hardware options are supported for running the model locally? The page lists PyTorch inference on NVIDIA GPUs, CPU inference via llama.cpp and Ollama, quantized int4 GGUF variants, and serving/inference frameworks including vLLM and SGLang, plus FlagOS for multi-chip backends.

  • What kinds of visual inputs can it handle? It supports high-resolution images up to 1.8 million pixels and high-FPS videos up to 10fps in any aspect ratio, as stated on the page.

Alternatives

  • Other multimodal streaming/real-time LLM systems: Instead of a full-duplex omni-modal model, some solutions use separate pipelines (e.g., vision-to-text + ASR + TTS). These differ by workflow: they may not provide the same end-to-end, concurrent input/output streaming behavior described here.
  • Speech-focused assistants without unified vision streaming: Speech-first voice assistants can handle real-time conversations, but may not combine continuous vision input with concurrent speech/text outputs in the same end-to-end way.
  • Local OCR/document parsing toolchains: For document parsing tasks, dedicated OCR/document extraction tools may be more specialized; however, they typically focus on text extraction rather than the broader omni-modal live interaction (vision + speech + proactive behavior).