Voice-agent orchestration
The library provides a voice-agent pipeline in C++17 for on-device speech processing, including turn detection, interruption handling, audio utilities, and conversation tracking.
speech-core is a C++17 library for on-device speech orchestration, including VAD, streaming and batch STT, diarization, TTS, and a voice-agent pipeline. It runs locally and uses optional ONNX Runtime or LiteRT backends for model inference.
speech-core is a C++17 library for building on-device speech systems and voice-agent pipelines. It combines orchestration logic with abstract interfaces for speech-to-text, text-to-speech, voice activity detection, diarization, enhancement, echo cancellation, and optional language-model hooks.
The project is designed to run locally on CPU, with model inference added through optional ONNX Runtime or LiteRT backends. The docs position it as a portable core for Linux, Windows, and Android, with Apple support available through a separate Swift sibling library.
The library provides a voice-agent pipeline in C++17 for on-device speech processing, including turn detection, interruption handling, audio utilities, and conversation tracking.
It supports on-device voice activity detection, batch transcription, real-time streaming transcription, diarization, and text-to-speech from the same core package.
Model inference is opt-in through ONNX Runtime or LiteRT, and the docs list specific models available through each backend.
The orchestration layer depends only on abstract interfaces, so callers can plug in custom implementations such as CPU, GPU, CoreML/MLX, or a remote API.
The repository includes platform and backend guidance for Linux, Windows, Android, and macOS, along with notes on hardware acceleration paths such as CUDA, TensorRT, NNAPI, and QNN.
Example code in the docs shows direct use of the C++ interfaces for transcription, streaming partials, diarization, and TTS synthesis.
Build a local assistant pipeline that detects speech, streams transcripts, passes events to an LLM, and generates responses without routing audio through a cloud service.
Add low-latency transcription to desktop or mobile apps, including partial results while audio is still arriving.
Segment audio by speaker for meeting notes, interview analysis, or diarization-aware post-processing.
Embed VAD and TTS directly into a product that needs local audio interaction, such as an offline assistant or embedded speech feature.
Integrate only the orchestration layer and replace the reference models with custom CPU, GPU, CoreML, MLX, or remote implementations.
It is a C++17 speech core for local voice-agent workflows, with optional ONNX Runtime and LiteRT backends for the model implementations.
The repository documents Linux, Windows, and Android support, and also notes Apple support through a Swift sibling library.
Yes. The core orchestration layer can run without model backends, and consumers can enable ONNX, LiteRT, both, or neither while supplying their own implementations.
The docs show batch transcription, streaming transcription with partials, voice activity detection, speaker diarization, text-to-speech, and a voice-agent pipeline that combines these pieces.
Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. It offers a free tier for short tasks and a Pro plan for unlimited transcription, AI summaries, translation, speaker identification, and advanced exports.
Sanota is an app that turns spoken memories, reflections, and interviews into clear written stories. It supports personal storytelling, family history, and shared memories, with guided prompts and subscription pricing.
Carbon Voice is an asynchronous voice messaging app for teams and individuals, with transcripts, AI catch-up, and cross-device access. It helps people and agents communicate without needing a live call.
Talkpal is an AI-powered language learning web and mobile app for practicing speaking, listening, writing, and pronunciation. It offers guided courses, roleplays, and call-style conversation practice across 130+ languages.
An OpenAI API guide for choosing the right speech architecture for live audio, translation, transcription, speech generation, and audio-capable chat. It helps developers map each speech application to the appropriate session type, endpoint, and connection method.
Pewbeam is a church presentation app that listens to sermons, detects Bible verse references in real time, and displays the matching passage on screen. It is built for pastors, projection teams, and church media volunteers who want to reduce manual slide control during live services.