Real-time local speech-to-text transcription using decibri and whisper.cpp in Node.js. Runs entirely offline with no API key, no cloud service, and no network dependency.
This integration captures live audio from your microphone using decibri, buffers it into segments, and feeds each segment to whisper.cpp for transcription. Text appears in your terminal as you speak. Everything runs locally on your machine using OpenAI's Whisper model.
Choose this when you need high-accuracy transcription with support for multiple languages and model sizes. It is ideal for offline environments, privacy-sensitive applications, or batch-style voice pipelines where you can tolerate a short processing delay between segments. For continuous real-time streaming with minimal latency, use sherpa-onnx instead. For higher accuracy batch transcription with model size flexibility, use whisper.cpp.
Download a GGML model file from the whisper.cpp model repository. For example, the tiny English model:
curl -L -o ggml-tiny.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin
On Windows, curl is available in PowerShell and Command Prompt (Windows 10+). You can also download the file directly from the model repository in your browser.
Choose a model based on your speed and accuracy requirements. Larger models are more accurate but slower to process.
| Model | Size | Speed | Accuracy | Language |
|---|---|---|---|---|
ggml-tiny.en.bin |
~75 MB | Fastest | Lower | English only |
ggml-base.en.bin |
~142 MB | Fast | Good | English only |
ggml-small.en.bin |
~466 MB | Moderate | Higher | English only |
ggml-base.bin |
~142 MB | Fast | Good | Multilingual |
The model must fit in system memory. For the tiny model, this is negligible. For larger models, ensure your machine has sufficient RAM. Multilingual models support automatic language detection with language: 'auto' or a specific language code such as 'fr', 'de', or 'ja'.
Unlike sherpa-onnx, which processes audio as a continuous real-time stream, whisper.cpp transcribes audio in discrete segments. The integration pattern is a loop:
This means there is a short delay between speaking and seeing text, determined by the buffer duration you choose.
Set the model path, buffer duration, and sample rate. The addon accepts Float32 PCM audio at the sample rate the model expects (16 kHz for all standard Whisper models). GPU acceleration via Vulkan is enabled by default. Set use_gpu: false if Vulkan is not available on your system.
const Decibri = require('decibri');
const { transcribe } = require('@kutalia/whisper-node-addon');
const MODEL_PATH = './ggml-tiny.en.bin';
const SAMPLE_RATE = 16000;
const BUFFER_SECONDS = 5; // shorter = faster feedback, longer = better accuracy
Create a decibri instance at 16 kHz mono. The default format is 16-bit signed integer PCM.
const mic = new Decibri({ sampleRate: SAMPLE_RATE, channels: 1 });
Accumulate incoming PCM chunks into a list. When the buffer reaches the target duration, trigger transcription.
const chunks = [];
let bufferedSamples = 0;
const targetSamples = SAMPLE_RATE * BUFFER_SECONDS;
mic.on('data', (chunk) => {
chunks.push(chunk);
bufferedSamples += chunk.length / 2; // 2 bytes per Int16 sample
if (bufferedSamples >= targetSamples) {
processBuffer();
}
});
Concatenate the buffered chunks, convert Int16 PCM to Float32 (the addon expects pcmf32 as a Float32Array with samples in the range -1.0 to 1.0), and call transcribe().
let processing = false;
async function processBuffer() {
if (processing) return;
processing = true;
// Concatenate buffered chunks
const pcm16 = Buffer.concat(chunks);
chunks.length = 0;
bufferedSamples = 0;
// Convert Int16 PCM to Float32
const int16 = new Int16Array(pcm16.buffer, pcm16.byteOffset, pcm16.length / 2);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) {
float32[i] = int16[i] / 32768;
}
// Transcribe
const result = await transcribe({
pcmf32: float32,
model: MODEL_PATH,
language: 'en', // use 'auto' with multilingual models (e.g. ggml-base.bin)
no_timestamps: true,
});
// transcription may be nested arrays, so flatten to a single string
const text = result.transcription.flat().join(' ').trim();
if (text) console.log(text);
processing = false;
}
no_prints: true). Set it to false if you need debug output.
Stop the microphone and process any remaining audio when the user presses Ctrl+C.
process.on('SIGINT', async () => {
mic.stop();
if (chunks.length > 0) await processBuffer();
process.exit(0);
});
console.log('Listening... (Ctrl+C to stop)');
const Decibri = require('decibri');
const { transcribe } = require('@kutalia/whisper-node-addon');
const MODEL_PATH = './ggml-tiny.en.bin';
const SAMPLE_RATE = 16000;
const BUFFER_SECONDS = 5; // shorter = faster feedback, longer = better accuracy
const mic = new Decibri({ sampleRate: SAMPLE_RATE, channels: 1 });
const chunks = [];
let bufferedSamples = 0;
const targetSamples = SAMPLE_RATE * BUFFER_SECONDS;
let processing = false;
async function processBuffer() {
if (processing) return;
processing = true;
const pcm16 = Buffer.concat(chunks);
chunks.length = 0;
bufferedSamples = 0;
const int16 = new Int16Array(pcm16.buffer, pcm16.byteOffset, pcm16.length / 2);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) {
float32[i] = int16[i] / 32768;
}
const result = await transcribe({
pcmf32: float32,
model: MODEL_PATH,
language: 'en', // use 'auto' with multilingual models (e.g. ggml-base.bin)
no_timestamps: true,
});
// transcription may be nested arrays, so flatten to a single string
const text = result.transcription.flat().join(' ').trim();
if (text) console.log(text);
processing = false;
}
mic.on('data', (chunk) => {
chunks.push(chunk);
bufferedSamples += chunk.length / 2;
if (bufferedSamples >= targetSamples) {
processBuffer();
}
});
process.on('SIGINT', async () => {
mic.stop();
if (chunks.length > 0) await processBuffer();
process.exit(0);
});
console.log('Listening... (Ctrl+C to stop)');
The pattern is the same regardless of which whisper.cpp binding you use: decibri provides PCM audio, and the binding provides inference. Other community bindings that work with decibri include: