Whisper.cpp Speech-to-Text

Real-time local speech-to-text transcription using decibri and whisper.cpp in Node.js. Runs entirely offline with no API key, no cloud service, and no network dependency.

What this does

This integration captures live audio from your microphone using decibri, buffers it into segments, and feeds each segment to whisper.cpp for transcription. Text appears in your terminal as you speak. Everything runs locally on your machine using OpenAI's Whisper model.

Choose this when you need high-accuracy transcription with support for multiple languages and model sizes. It is ideal for offline environments, privacy-sensitive applications, or batch-style voice pipelines where you can tolerate a short processing delay between segments. For continuous real-time streaming with minimal latency, use sherpa-onnx instead. For higher accuracy batch transcription with model size flexibility, use whisper.cpp.

Why a binding layer is needed

Note: whisper.cpp does not ship an official Node.js native binding. Its official JavaScript support targets browsers via WebAssembly. To call the C++ inference engine from Node.js, a community-maintained native addon is required. This guide uses @kutalia/whisper-node-addon for its prebuilt binaries, GPU acceleration, and direct PCM buffer support.

Prerequisites

Install packages

$ npm install decibri @kutalia/whisper-node-addon

Download a Whisper model

Download a GGML model file from the whisper.cpp model repository. For example, the tiny English model:

curl -L -o ggml-tiny.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin

On Windows, curl is available in PowerShell and Command Prompt (Windows 10+). You can also download the file directly from the model repository in your browser.

Model sizes

Choose a model based on your speed and accuracy requirements. Larger models are more accurate but slower to process.

Model Size Speed Accuracy Language
ggml-tiny.en.bin ~75 MB Fastest Lower English only
ggml-base.en.bin ~142 MB Fast Good English only
ggml-small.en.bin ~466 MB Moderate Higher English only
ggml-base.bin ~142 MB Fast Good Multilingual

The model must fit in system memory. For the tiny model, this is negligible. For larger models, ensure your machine has sufficient RAM. Multilingual models support automatic language detection with language: 'auto' or a specific language code such as 'fr', 'de', or 'ja'.

How it works

Unlike sherpa-onnx, which processes audio as a continuous real-time stream, whisper.cpp transcribes audio in discrete segments. The integration pattern is a loop:

  1. Capture audio continuously with decibri
  2. Buffer several seconds of audio
  3. Send the buffer to whisper.cpp for transcription
  4. Print the result, clear the buffer, repeat

This means there is a short delay between speaking and seeing text, determined by the buffer duration you choose.

Code walkthrough

1. Configuration

Set the model path, buffer duration, and sample rate. The addon accepts Float32 PCM audio at the sample rate the model expects (16 kHz for all standard Whisper models). GPU acceleration via Vulkan is enabled by default. Set use_gpu: false if Vulkan is not available on your system.

const Decibri = require('decibri');
const { transcribe } = require('@kutalia/whisper-node-addon');

const MODEL_PATH = './ggml-tiny.en.bin';
const SAMPLE_RATE = 16000;
const BUFFER_SECONDS = 5; // shorter = faster feedback, longer = better accuracy

2. Open the microphone

Create a decibri instance at 16 kHz mono. The default format is 16-bit signed integer PCM.

const mic = new Decibri({ sampleRate: SAMPLE_RATE, channels: 1 });

3. Buffer audio

Accumulate incoming PCM chunks into a list. When the buffer reaches the target duration, trigger transcription.

const chunks = [];
let bufferedSamples = 0;
const targetSamples = SAMPLE_RATE * BUFFER_SECONDS;

mic.on('data', (chunk) => {
  chunks.push(chunk);
  bufferedSamples += chunk.length / 2; // 2 bytes per Int16 sample

  if (bufferedSamples >= targetSamples) {
    processBuffer();
  }
});

4. Convert and transcribe

Concatenate the buffered chunks, convert Int16 PCM to Float32 (the addon expects pcmf32 as a Float32Array with samples in the range -1.0 to 1.0), and call transcribe().

let processing = false;

async function processBuffer() {
  if (processing) return;
  processing = true;

  // Concatenate buffered chunks
  const pcm16 = Buffer.concat(chunks);
  chunks.length = 0;
  bufferedSamples = 0;

  // Convert Int16 PCM to Float32
  const int16 = new Int16Array(pcm16.buffer, pcm16.byteOffset, pcm16.length / 2);
  const float32 = new Float32Array(int16.length);
  for (let i = 0; i < int16.length; i++) {
    float32[i] = int16[i] / 32768;
  }

  // Transcribe
  const result = await transcribe({
    pcmf32: float32,
    model: MODEL_PATH,
    language: 'en', // use 'auto' with multilingual models (e.g. ggml-base.bin)
    no_timestamps: true,
  });

  // transcription may be nested arrays, so flatten to a single string
  const text = result.transcription.flat().join(' ').trim();
  if (text) console.log(text);

  processing = false;
}
Note: The first transcription call may take a few extra seconds while the model loads into memory. Subsequent calls are faster. The addon suppresses whisper.cpp's native console output by default (no_prints: true). Set it to false if you need debug output.

5. Clean shutdown

Stop the microphone and process any remaining audio when the user presses Ctrl+C.

process.on('SIGINT', async () => {
  mic.stop();
  if (chunks.length > 0) await processBuffer();
  process.exit(0);
});

console.log('Listening... (Ctrl+C to stop)');

Full example

View complete code
const Decibri = require('decibri');
const { transcribe } = require('@kutalia/whisper-node-addon');

const MODEL_PATH = './ggml-tiny.en.bin';
const SAMPLE_RATE = 16000;
const BUFFER_SECONDS = 5; // shorter = faster feedback, longer = better accuracy

const mic = new Decibri({ sampleRate: SAMPLE_RATE, channels: 1 });

const chunks = [];
let bufferedSamples = 0;
const targetSamples = SAMPLE_RATE * BUFFER_SECONDS;
let processing = false;

async function processBuffer() {
  if (processing) return;
  processing = true;

  const pcm16 = Buffer.concat(chunks);
  chunks.length = 0;
  bufferedSamples = 0;

  const int16 = new Int16Array(pcm16.buffer, pcm16.byteOffset, pcm16.length / 2);
  const float32 = new Float32Array(int16.length);
  for (let i = 0; i < int16.length; i++) {
    float32[i] = int16[i] / 32768;
  }

  const result = await transcribe({
    pcmf32: float32,
    model: MODEL_PATH,
    language: 'en', // use 'auto' with multilingual models (e.g. ggml-base.bin)
    no_timestamps: true,
  });

  // transcription may be nested arrays, so flatten to a single string
  const text = result.transcription.flat().join(' ').trim();
  if (text) console.log(text);

  processing = false;
}

mic.on('data', (chunk) => {
  chunks.push(chunk);
  bufferedSamples += chunk.length / 2;

  if (bufferedSamples >= targetSamples) {
    processBuffer();
  }
});

process.on('SIGINT', async () => {
  mic.stop();
  if (chunks.length > 0) await processBuffer();
  process.exit(0);
});

console.log('Listening... (Ctrl+C to stop)');

Alternative bindings

The pattern is the same regardless of which whisper.cpp binding you use: decibri provides PCM audio, and the binding provides inference. Other community bindings that work with decibri include: