OpenAI Real-Time Transcription

Stream live microphone audio to OpenAI for real-time transcription using decibri and the OpenAI Realtime API.

What this does

This integration captures live audio from your microphone using decibri and streams it to OpenAI's Realtime API over a WebSocket. Transcription results arrive progressively as word-by-word deltas, followed by a completed transcript after the server-side voice activity detection (VAD) detects a pause. There is no model download and no local inference required.

Choose this when you want to use OpenAI's transcription models directly. For cloud STT with an official SDK wrapper and simpler setup, see Deepgram or AssemblyAI. For use cases where audio must stay on your device, see the sherpa-onnx or whisper.cpp local integrations.

Why raw WebSocket

Note: The openai npm package does not provide a high-level wrapper for the Realtime transcription API. This integration uses the ws WebSocket package directly to connect to OpenAI's streaming endpoint. This means you manage the connection, session configuration, and message parsing yourself. The tradeoff is full control over the streaming lifecycle.

Cloud vs local

Note: Audio is sent to OpenAI's servers for processing. Review OpenAI's data usage policy for details on how audio data is handled. If your use case requires audio to stay entirely on-device, use the local integrations: sherpa-onnx (real-time streaming) or whisper.cpp (batch transcription).

Prerequisites

Get an API key

  1. Sign up at platform.openai.com
  2. Create an API key from the dashboard
  3. Store it in a .env file in your project root:
OPENAI_API_KEY=your_key_here

Install packages

$ npm install decibri ws dotenv

The dotenv package loads your API key from the .env file. The ws package provides the WebSocket connection. If you set environment variables another way, you can skip dotenv.

No model download is required. All processing happens in OpenAI's cloud.

Sample rate

Important: OpenAI's Realtime API defaults to 24000 Hz, not 16000 Hz. This is different from every other integration in these docs. Make sure decibri's sampleRate is set to 24000 to match.

Code walkthrough

1. Configuration

Import decibri, the ws WebSocket package, and dotenv. Set the Realtime API endpoint and your API key.

require('dotenv').config();

const Decibri = require('decibri');
const WebSocket = require('ws');

const API_KEY = process.env.OPENAI_API_KEY;
const WS_URL = 'wss://api.openai.com/v1/realtime?intent=transcription';

2. Connect to OpenAI

Create a WebSocket connection with your API key in the Authorization header.

const ws = new WebSocket(WS_URL, {
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
  },
});

3. Configure the transcription session

After the WebSocket opens, send a session.update message to configure the transcription model and voice activity detection. The audio format defaults to PCM16 at 24000 Hz, which matches decibri's output when configured at that sample rate.

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      type: 'transcription',
      audio: {
        input: {
          transcription: {
            model: 'gpt-4o-mini-transcribe',
          },
          turn_detection: {
            type: 'server_vad',
            threshold: 0.5,
            silence_duration_ms: 500,
            prefix_padding_ms: 300,
          },
        },
      },
    },
  }));
});

4. Open the microphone and stream audio

Create a decibri instance at 24000 Hz mono. Each audio chunk must be base64-encoded before sending. This is the one extra step compared to Deepgram and AssemblyAI, which accept raw buffers directly.

const mic = new Decibri({ sampleRate: 24000, channels: 1 });

mic.on('data', (chunk) => {
  // Base64-encode PCM16 data before sending
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

5. Handle transcription results

OpenAI sends a sequence of events for each speech segment. The full flow is:

  1. input_audio_buffer.speech_started when VAD detects speech
  2. input_audio_buffer.speech_stopped when VAD detects silence
  3. input_audio_buffer.committed when the audio chunk is committed
  4. conversation.item.input_audio_transcription.delta for each word as it is recognized
  5. conversation.item.input_audio_transcription.completed with the final transcript

To print only final transcripts:

ws.on('message', (data) => {
  const event = JSON.parse(data.toString());

  if (event.type === 'conversation.item.input_audio_transcription.completed') {
    console.log(event.transcript);
  }

  if (event.type === 'error') {
    console.error('Error:', event.error);
  }
});

Each completed event also includes a usage object with token counts (input_tokens, output_tokens, total_tokens), which can be used to track costs.

6. Clean shutdown

Stop the microphone and close the WebSocket when the user presses Ctrl+C.

ws.on('close', (code, reason) => {
  console.log('Connection closed:', code, reason.toString());
});

ws.on('error', (err) => {
  console.error('WebSocket error:', err.message);
});

mic.on('error', (err) => {
  console.error('Mic error:', err.message);
});

process.on('SIGINT', () => {
  console.log('\nStopping...');
  mic.stop();
  ws.close();
  process.exit(0);
});

Full example

View complete code
'use strict';
require('dotenv').config();

const Decibri = require('decibri');
const WebSocket = require('ws');

const API_KEY = process.env.OPENAI_API_KEY;
const WS_URL = 'wss://api.openai.com/v1/realtime?intent=transcription';

const run = async () => {
  const ws = new WebSocket(WS_URL, {
    headers: {
      'Authorization': `Bearer ${API_KEY}`,
    },
  });

  ws.on('open', () => {
    console.log('Connected to OpenAI Realtime API');

    // Configure transcription session
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'transcription',
        audio: {
          input: {
            transcription: {
              model: 'gpt-4o-mini-transcribe',
            },
            turn_detection: {
              type: 'server_vad',
              threshold: 0.5,
              silence_duration_ms: 500,
              prefix_padding_ms: 300,
            },
          },
        },
      },
    }));

    console.log('Session configured. Opening microphone...\n');

    // Important: 24000 Hz, not 16000 Hz. OpenAI Realtime API defaults to 24 kHz.
    const mic = new Decibri({ sampleRate: 24000, channels: 1 });

    mic.on('data', (chunk) => {
      ws.send(JSON.stringify({
        type: 'input_audio_buffer.append',
        audio: chunk.toString('base64'),
      }));
    });

    mic.on('error', (err) => {
      console.error('Mic error:', err.message);
    });

    process.on('SIGINT', () => {
      console.log('\nStopping...');
      mic.stop();
      ws.close();
      process.exit(0);
    });
  });

  ws.on('message', (data) => {
    const event = JSON.parse(data.toString());

    if (event.type === 'conversation.item.input_audio_transcription.completed') {
      console.log(event.transcript);
    }

    if (event.type === 'error') {
      console.error('Error:', event.error);
    }
  });

  ws.on('error', (err) => {
    console.error('WebSocket error:', err.message);
  });

  ws.on('close', (code, reason) => {
    console.log('Connection closed:', code, reason.toString());
  });
};

run().catch(console.error);

Configuration options

The session configuration controls how OpenAI processes your audio. These options are sent in the session.update message after connecting.

Option Value Description
transcription.model 'gpt-4o-mini-transcribe' Transcription model. Use 'gpt-4o-transcribe' for higher accuracy.
turn_detection.type 'server_vad' Server-side voice activity detection for automatic turn boundaries.
turn_detection.threshold 0.5 VAD sensitivity from 0 to 1. Lower values detect quieter speech.
turn_detection.silence_duration_ms 500 Milliseconds of silence before a turn is considered complete.
turn_detection.prefix_padding_ms 300 Audio to include before the start of detected speech.
noise_reduction.type 'near_field' Noise reduction mode. Use 'far_field' for distant microphones.

See the OpenAI Realtime API documentation for the complete list of options.