speech-core icon

speech-core

speech-core is a C++17 library for on-device speech orchestration, including VAD, streaming and batch STT, diarization, TTS, and a voice-agent pipeline. It runs locally and uses optional ONNX Runtime or LiteRT backends for model inference.

speech-core

On-device speech orchestration in C++17

speech-core is a C++17 library for building on-device speech systems and voice-agent pipelines. It combines orchestration logic with abstract interfaces for speech-to-text, text-to-speech, voice activity detection, diarization, enhancement, echo cancellation, and optional language-model hooks.

The project is designed to run locally on CPU, with model inference added through optional ONNX Runtime or LiteRT backends. The docs position it as a portable core for Linux, Windows, and Android, with Apple support available through a separate Swift sibling library.

Core capabilities

Voice-agent orchestration

The library provides a voice-agent pipeline in C++17 for on-device speech processing, including turn detection, interruption handling, audio utilities, and conversation tracking.

Multiple speech tasks in one core

It supports on-device voice activity detection, batch transcription, real-time streaming transcription, diarization, and text-to-speech from the same core package.

Backend flexibility

Model inference is opt-in through ONNX Runtime or LiteRT, and the docs list specific models available through each backend.

Interface-driven design

The orchestration layer depends only on abstract interfaces, so callers can plug in custom implementations such as CPU, GPU, CoreML/MLX, or a remote API.

Cross-platform deployment

The repository includes platform and backend guidance for Linux, Windows, Android, and macOS, along with notes on hardware acceleration paths such as CUDA, TensorRT, NNAPI, and QNN.

Documented integration patterns

Example code in the docs shows direct use of the C++ interfaces for transcription, streaming partials, diarization, and TTS synthesis.

Where it fits

  • Voice-agent applications

    Build a local assistant pipeline that detects speech, streams transcripts, passes events to an LLM, and generates responses without routing audio through a cloud service.

  • Streaming speech-to-text

    Add low-latency transcription to desktop or mobile apps, including partial results while audio is still arriving.

  • Speaker diarization workflows

    Segment audio by speaker for meeting notes, interview analysis, or diarization-aware post-processing.

  • On-device audio experiences

    Embed VAD and TTS directly into a product that needs local audio interaction, such as an offline assistant or embedded speech feature.

  • Custom backend integration

    Integrate only the orchestration layer and replace the reference models with custom CPU, GPU, CoreML, MLX, or remote implementations.

Pros and Cons

Pros

  • Runs locally without sending audio to a cloud service, which fits offline or privacy-sensitive workflows.
  • Separates orchestration from model implementations through pure C++ interfaces.
  • Supports several speech tasks, including VAD, streaming STT, diarization, TTS, and speech enhancement.
  • Provides two interchangeable model backends, making it easier to choose a deployment path that matches the target platform.

Cons

  • The model backends are optional, so callers still need to choose and wire a backend or provide their own implementations for end-to-end speech features.
  • LiteRT support is documented as CPU-only today, with Hexagon and GPU delegates noted as not yet exposed through the C API.

FAQ

What is speech-core designed to do?

It is a C++17 speech core for local voice-agent workflows, with optional ONNX Runtime and LiteRT backends for the model implementations.

Which platforms does it support?

The repository documents Linux, Windows, and Android support, and also notes Apple support through a Swift sibling library.

Do I have to use the built-in model backends?

Yes. The core orchestration layer can run without model backends, and consumers can enable ONNX, LiteRT, both, or neither while supplying their own implementations.

What workflows does it cover?

The docs show batch transcription, streaming transcription with partials, voice activity detection, speaker diarization, text-to-speech, and a voice-agent pipeline that combines these pieces.

Quick Facts

Category
Developer Tool
Language
C++17
Source
GitHub: soniqo/speech-core
Platforms
Linux, Windows, Android, macOS
Backends
ONNX Runtime, LiteRT
Deployment
On-device / local inference

speech-core Alternatives

Speech to Text Converter icon

Speech to Text Converter

Speech to Text Converter is a browser-based transcription tool for live dictation and uploaded audio or video files. It offers a free tier for short tasks and a Pro plan for unlimited transcription, AI summaries, translation, speaker identification, and advanced exports.

Sanota icon

Sanota

Sanota is an app that turns spoken memories, reflections, and interviews into clear written stories. It supports personal storytelling, family history, and shared memories, with guided prompts and subscription pricing.

Carbon Voice icon

Carbon Voice

Carbon Voice is an asynchronous voice messaging app for teams and individuals, with transcripts, AI catch-up, and cross-device access. It helps people and agents communicate without needing a live call.

Talkpal icon

Talkpal

Talkpal is an AI-powered language learning web and mobile app for practicing speaking, listening, writing, and pronunciation. It offers guided courses, roleplays, and call-style conversation practice across 130+ languages.

Realtime and audio icon

Realtime and audio

An OpenAI API guide for choosing the right speech architecture for live audio, translation, transcription, speech generation, and audio-capable chat. It helps developers map each speech application to the appropriate session type, endpoint, and connection method.

Pewbeam icon

Pewbeam

Pewbeam is a church presentation app that listens to sermons, detects Bible verse references in real time, and displays the matching passage on screen. It is built for pastors, projection teams, and church media volunteers who want to reduce manual slide control during live services.