Stream live microphone audio to OpenAI for real-time transcription using decibri and the OpenAI Realtime API.
This integration captures live audio from your microphone using decibri and streams it to OpenAI's Realtime API over a WebSocket. Transcription results arrive progressively as word-by-word deltas, followed by a completed transcript after the server-side voice activity detection (VAD) detects a pause. There is no model download and no local inference required.
Choose this when you want to use OpenAI's transcription models directly. For cloud STT with an official SDK wrapper and simpler setup, see Deepgram or AssemblyAI. For use cases where audio must stay on your device, see the sherpa-onnx or whisper.cpp local integrations.
openai npm package does not provide a high-level wrapper for the Realtime transcription API. This integration uses the ws WebSocket package directly to connect to OpenAI's streaming endpoint. This means you manage the connection, session configuration, and message parsing yourself. The tradeoff is full control over the streaming lifecycle.
.env file in your project root:OPENAI_API_KEY=your_key_here
The dotenv package loads your API key from the .env file. The ws package provides the WebSocket connection. If you set environment variables another way, you can skip dotenv.
No model download is required. All processing happens in OpenAI's cloud.
sampleRate is set to 24000 to match.
Import decibri, the ws WebSocket package, and dotenv. Set the Realtime API endpoint and your API key.
require('dotenv').config();
const Decibri = require('decibri');
const WebSocket = require('ws');
const API_KEY = process.env.OPENAI_API_KEY;
const WS_URL = 'wss://api.openai.com/v1/realtime?intent=transcription';
Create a WebSocket connection with your API key in the Authorization header.
const ws = new WebSocket(WS_URL, {
headers: {
'Authorization': `Bearer ${API_KEY}`,
},
});
After the WebSocket opens, send a session.update message to configure the transcription model and voice activity detection. The audio format defaults to PCM16 at 24000 Hz, which matches decibri's output when configured at that sample rate.
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'session.update',
session: {
type: 'transcription',
audio: {
input: {
transcription: {
model: 'gpt-4o-mini-transcribe',
},
turn_detection: {
type: 'server_vad',
threshold: 0.5,
silence_duration_ms: 500,
prefix_padding_ms: 300,
},
},
},
},
}));
});
Create a decibri instance at 24000 Hz mono. Each audio chunk must be base64-encoded before sending. This is the one extra step compared to Deepgram and AssemblyAI, which accept raw buffers directly.
const mic = new Decibri({ sampleRate: 24000, channels: 1 });
mic.on('data', (chunk) => {
// Base64-encode PCM16 data before sending
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: chunk.toString('base64'),
}));
});
OpenAI sends a sequence of events for each speech segment. The full flow is:
input_audio_buffer.speech_started when VAD detects speechinput_audio_buffer.speech_stopped when VAD detects silenceinput_audio_buffer.committed when the audio chunk is committedconversation.item.input_audio_transcription.delta for each word as it is recognizedconversation.item.input_audio_transcription.completed with the final transcriptTo print only final transcripts:
ws.on('message', (data) => {
const event = JSON.parse(data.toString());
if (event.type === 'conversation.item.input_audio_transcription.completed') {
console.log(event.transcript);
}
if (event.type === 'error') {
console.error('Error:', event.error);
}
});
Each completed event also includes a usage object with token counts (input_tokens, output_tokens, total_tokens), which can be used to track costs.
Stop the microphone and close the WebSocket when the user presses Ctrl+C.
ws.on('close', (code, reason) => {
console.log('Connection closed:', code, reason.toString());
});
ws.on('error', (err) => {
console.error('WebSocket error:', err.message);
});
mic.on('error', (err) => {
console.error('Mic error:', err.message);
});
process.on('SIGINT', () => {
console.log('\nStopping...');
mic.stop();
ws.close();
process.exit(0);
});
'use strict';
require('dotenv').config();
const Decibri = require('decibri');
const WebSocket = require('ws');
const API_KEY = process.env.OPENAI_API_KEY;
const WS_URL = 'wss://api.openai.com/v1/realtime?intent=transcription';
const run = async () => {
const ws = new WebSocket(WS_URL, {
headers: {
'Authorization': `Bearer ${API_KEY}`,
},
});
ws.on('open', () => {
console.log('Connected to OpenAI Realtime API');
// Configure transcription session
ws.send(JSON.stringify({
type: 'session.update',
session: {
type: 'transcription',
audio: {
input: {
transcription: {
model: 'gpt-4o-mini-transcribe',
},
turn_detection: {
type: 'server_vad',
threshold: 0.5,
silence_duration_ms: 500,
prefix_padding_ms: 300,
},
},
},
},
}));
console.log('Session configured. Opening microphone...\n');
// Important: 24000 Hz, not 16000 Hz. OpenAI Realtime API defaults to 24 kHz.
const mic = new Decibri({ sampleRate: 24000, channels: 1 });
mic.on('data', (chunk) => {
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: chunk.toString('base64'),
}));
});
mic.on('error', (err) => {
console.error('Mic error:', err.message);
});
process.on('SIGINT', () => {
console.log('\nStopping...');
mic.stop();
ws.close();
process.exit(0);
});
});
ws.on('message', (data) => {
const event = JSON.parse(data.toString());
if (event.type === 'conversation.item.input_audio_transcription.completed') {
console.log(event.transcript);
}
if (event.type === 'error') {
console.error('Error:', event.error);
}
});
ws.on('error', (err) => {
console.error('WebSocket error:', err.message);
});
ws.on('close', (code, reason) => {
console.log('Connection closed:', code, reason.toString());
});
};
run().catch(console.error);
The session configuration controls how OpenAI processes your audio. These options are sent in the session.update message after connecting.
| Option | Value | Description |
|---|---|---|
transcription.model |
'gpt-4o-mini-transcribe' |
Transcription model. Use 'gpt-4o-transcribe' for higher accuracy. |
turn_detection.type |
'server_vad' |
Server-side voice activity detection for automatic turn boundaries. |
turn_detection.threshold |
0.5 |
VAD sensitivity from 0 to 1. Lower values detect quieter speech. |
turn_detection.silence_duration_ms |
500 |
Milliseconds of silence before a turn is considered complete. |
turn_detection.prefix_padding_ms |
300 |
Audio to include before the start of detected speech. |
noise_reduction.type |
'near_field' |
Noise reduction mode. Use 'far_field' for distant microphones. |
See the OpenAI Realtime API documentation for the complete list of options.