speech-core

speech-core is a C++17 library for on-device speech orchestration, with VAD, streaming and batch STT, diarization, TTS, and a voice-agent pipeline.

AI Speech Recognition

AI Speech Synthesis

Transcription

Speech-to-Text

Visit Website

On-device speech orchestration in C++17

speech-core is a C++17 library for building on-device speech systems and voice-agent pipelines. It combines orchestration logic with abstract interfaces for speech-to-text, text-to-speech, voice activity detection, diarization, enhancement, echo cancellation, and optional language-model hooks.

The project is designed to run locally on CPU, with model inference added through optional ONNX Runtime or LiteRT backends. The docs position it as a portable core for Linux, Windows, and Android, with Apple support available through a separate Swift sibling library.

Core capabilities

Voice-agent orchestration

The library provides a voice-agent pipeline in C++17 for on-device speech processing, including turn detection, interruption handling, audio utilities, and conversation tracking.

Multiple speech tasks in one core

It supports on-device voice activity detection, batch transcription, real-time streaming transcription, diarization, and text-to-speech from the same core package.

Backend flexibility

Model inference is opt-in through ONNX Runtime or LiteRT, and the docs list specific models available through each backend.

Interface-driven design

The orchestration layer depends only on abstract interfaces, so callers can plug in custom implementations such as CPU, GPU, CoreML/MLX, or a remote API.

Cross-platform deployment

The repository includes platform and backend guidance for Linux, Windows, Android, and macOS, along with notes on hardware acceleration paths such as CUDA, TensorRT, NNAPI, and QNN.

Documented integration patterns

Example code in the docs shows direct use of the C++ interfaces for transcription, streaming partials, diarization, and TTS synthesis.

Where it fits

Voice-agent applications
Build a local assistant pipeline that detects speech, streams transcripts, passes events to an LLM, and generates responses without routing audio through a cloud service.
Streaming speech-to-text
Add low-latency transcription to desktop or mobile apps, including partial results while audio is still arriving.
Speaker diarization workflows
Segment audio by speaker for meeting notes, interview analysis, or diarization-aware post-processing.
On-device audio experiences
Embed VAD and TTS directly into a product that needs local audio interaction, such as an offline assistant or embedded speech feature.
Custom backend integration
Integrate only the orchestration layer and replace the reference models with custom CPU, GPU, CoreML, MLX, or remote implementations.

Pros and Cons

Pros

Runs locally without sending audio to a cloud service, which fits offline or privacy-sensitive workflows.
Separates orchestration from model implementations through pure C++ interfaces.
Supports several speech tasks, including VAD, streaming STT, diarization, TTS, and speech enhancement.
Provides two interchangeable model backends, making it easier to choose a deployment path that matches the target platform.

Cons

The model backends are optional, so callers still need to choose and wire a backend or provide their own implementations for end-to-end speech features.
LiteRT support is documented as CPU-only today, with Hexagon and GPU delegates noted as not yet exposed through the C API.

FAQ

What is speech-core designed to do?

It is a C++17 speech core for local voice-agent workflows, with optional ONNX Runtime and LiteRT backends for the model implementations.

Which platforms does it support?

The repository documents Linux, Windows, and Android support, and also notes Apple support through a Swift sibling library.

Do I have to use the built-in model backends?

Yes. The core orchestration layer can run without model backends, and consumers can enable ONNX, LiteRT, both, or neither while supplying their own implementations.

What workflows does it cover?

The docs show batch transcription, streaming transcription with partials, voice activity detection, speaker diarization, text-to-speech, and a voice-agent pipeline that combines these pieces.

Quick Facts

Category: Developer Tool
Language: C++17
Source: GitHub: soniqo/speech-core
Platforms: Linux, Windows, Android, macOS
Backends: ONNX Runtime, LiteRT
Deployment: On-device / local inference

speech-core Alternatives

QuickQuill

QuickQuill is a local-first macOS dictation and transcription app to record meetings, summarize audio, and export notes without the cloud.

Speech to Text Converter

Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. Free for short tasks, Pro offers unlimited transcription, AI summaries, translation, speaker ID, and advanced exports.

Sanota

Sanota turns spoken memories and interviews into clear written stories for personal storytelling, family history and shared memories, with guided prompts and subscriptions.

Carbon Voice

Carbon Voice is an async voice messaging app for teams and individuals, with transcripts, AI catch-up, and cross-device access.

Talkpal

Talkpal is an AI language learning app for web and mobile with speaking, listening, writing, pronunciation practice, guided courses, roleplays, and 130+ languages.

Realtime and audio

OpenAI API guide for choosing the right speech architecture for live audio, translation, transcription, speech generation, and audio-capable chat.