Mistral Voxtral Real-Time Transcription

Stream live microphone audio to Mistral's Voxtral model for real-time cloud transcription using decibri and the official Mistral AI SDK.

What this does

This integration captures live audio from your microphone using decibri and streams it to Mistral's Voxtral Realtime API via the official SDK. Transcription text arrives progressively as you speak. No model download, no local inference, and no audio format conversion. decibri's default output matches Voxtral's expected input exactly.

Voxtral is an open-weights model (Apache 2.0) built by Mistral, a European AI company. It supports 13 languages with automatic detection. Choose this when you want a cloud STT with open-weights transparency, or when you plan to self-host via vLLM. For use cases where audio must stay on your device, see the sherpa-onnx or whisper.cpp local integrations instead.

ES Modules required

Important: The Mistral AI SDK is ESM-only. You must use import syntax, not require(). Add "type": "module" to your package.json, or use a .mjs file extension. This is different from the other decibri integration examples which use CommonJS. For environment variables, use import 'dotenv/config' instead of require('dotenv').config().

Cloud vs local

Note: Mistral Voxtral is a cloud service. Audio is sent to Mistral's servers for processing. Mistral is a European company. See Mistral's privacy documentation for data handling details. If your use case requires audio to stay entirely on-device, use the local integrations: sherpa-onnx (real-time streaming) or whisper.cpp (batch transcription).

Prerequisites

Get an API key

  1. Sign up at console.mistral.ai
  2. Create an API key from the dashboard
  3. Store it in a .env file in your project root:
MISTRAL_API_KEY=your_key_here

Install packages

$ npm install decibri @mistralai/mistralai dotenv

The dotenv package loads your API key from the .env file. If you set environment variables another way, you can omit dotenv from the install command.

No model download is required. All processing happens in Mistral's cloud.

Code walkthrough

1. Configuration

Import decibri, the Mistral Realtime SDK, and dotenv. decibri is a CommonJS package; the default import works in ESM via Node.js interop.

import 'dotenv/config';
import Decibri from 'decibri';
import { RealtimeTranscription, AudioEncoding } from '@mistralai/mistralai/extra/realtime';

const API_KEY = process.env.MISTRAL_API_KEY;
const MODEL = 'voxtral-mini-transcribe-realtime-2602';

const audioFormat = {
  encoding: AudioEncoding.PcmS16le,
  sampleRate: 16000,
};

The audio format matches decibri's default output exactly: 16-bit signed integer PCM, little-endian, 16 kHz mono. No configuration needed on the decibri side.

2. Create audio stream

The Mistral SDK expects an AsyncGenerator<Uint8Array> as its audio input. Wrap decibri's Readable stream as an async generator that yields each chunk as a Uint8Array.

async function* createAudioStream() {
  const mic = new Decibri({ sampleRate: 16000, channels: 1 });
  console.log('Listening... Speak into your microphone. (Ctrl+C to stop)');

  try {
    for await (const chunk of mic) {
      yield new Uint8Array(chunk);
    }
  } finally {
    mic.stop();
  }
}

The for await...of loop consumes decibri as an async iterable (all Node.js Readable streams support this). The finally block ensures the microphone is stopped when the generator is closed.

3. Create client and transcribe

Create a RealtimeTranscription client and call transcribeStream() with the audio generator, model, and format. The SDK handles the WebSocket connection internally.

const client = new RealtimeTranscription({ apiKey: API_KEY });
const audioStream = createAudioStream();

for await (const event of client.transcribeStream(
  audioStream,
  MODEL,
  { audioFormat }
)) {
  // handle events
}

4. Handle events

The SDK emits three event types. Handle progressive text deltas, completion, and errors.

if (event.type === 'transcription.text.delta') {
  process.stdout.write(event.text);
} else if (event.type === 'transcription.done') {
  process.stdout.write('\n');
  break;
} else if (event.type === 'error') {
  const msg = typeof event.error.message === 'string'
    ? event.error.message
    : JSON.stringify(event.error.message);
  console.error('\nError:', msg);
  break;
}

transcription.text.delta events contain partial transcription text that arrives word-by-word as you speak. transcription.done signals the end of a transcription segment.

5. Clean shutdown

Wrap the transcription loop in a try/finally block. Calling audioStream.return() triggers the generator's finally block, which stops the microphone. Ctrl+C interrupts the async iterator and flows through the same cleanup path.

try {
  for await (const event of client.transcribeStream(audioStream, MODEL, { audioFormat })) {
    // ... handle events
  }
} finally {
  await audioStream.return?.();
}

Full example

View complete code
import 'dotenv/config';
import Decibri from 'decibri';
import { RealtimeTranscription, AudioEncoding } from '@mistralai/mistralai/extra/realtime';

const API_KEY = process.env.MISTRAL_API_KEY;
const MODEL = 'voxtral-mini-transcribe-realtime-2602';

const audioFormat = {
  encoding: AudioEncoding.PcmS16le,
  sampleRate: 16000,
};

async function* createAudioStream() {
  const mic = new Decibri({ sampleRate: 16000, channels: 1 });
  console.log('Listening... Speak into your microphone. (Ctrl+C to stop)');

  try {
    for await (const chunk of mic) {
      yield new Uint8Array(chunk);
    }
  } finally {
    mic.stop();
  }
}

const client = new RealtimeTranscription({ apiKey: API_KEY });
const audioStream = createAudioStream();

try {
  for await (const event of client.transcribeStream(
    audioStream,
    MODEL,
    { audioFormat }
  )) {
    if (event.type === 'transcription.text.delta') {
      process.stdout.write(event.text);
    } else if (event.type === 'transcription.done') {
      process.stdout.write('\n');
      break;
    } else if (event.type === 'error') {
      const msg = typeof event.error.message === 'string'
        ? event.error.message
        : JSON.stringify(event.error.message);
      console.error('\nError:', msg);
      break;
    }
  }
} finally {
  await audioStream.return?.();
}

Save this as a .mjs file (e.g. transcribe.mjs) or add "type": "module" to your package.json, then run with node transcribe.mjs.

Configuration options

Options passed to transcribeStream() and the RealtimeTranscription client.

Option Default Description
model 'voxtral-mini-transcribe-realtime-2602' Realtime transcription model. Currently the only realtime-capable model.
encoding AudioEncoding.PcmS16le Audio encoding. PCM 16-bit signed little-endian matches decibri's default.
sampleRate 16000 Sample rate in Hz. 16 kHz matches decibri's default.
targetStreamingDelayMs none Optional. Milliseconds to wait before starting transcription to gather context. 480 ms is a good balance between latency and accuracy. Range: 240–2400 ms.
serverURL 'wss://api.mistral.ai' Optional. WebSocket endpoint. Override for self-hosted deployments via vLLM.

Check Mistral's model documentation for the latest model version and available options.