Azure Speech-to-Text Streaming

Stream live microphone audio to Azure Speech-to-Text using decibri and PushAudioInputStream. Fills the missing microphone support in Azure's Node.js SDK.

What this does

This integration captures live audio from your microphone using decibri and pushes it into Azure's PushAudioInputStream for real-time cloud transcription. Results come back via event callbacks as you speak, with both partial and final results.

Important: Azure's Speech SDK does not support microphone input in Node.js. The fromDefaultMicrophoneInput() method is browser-only. decibri provides the missing audio capture layer, feeding PCM audio into Azure's PushAudioInputStream.

Choose this when you need Azure's speech models, enterprise Azure integration, or the most generous free tier (5 hours/month). For use cases where audio must stay on your device, see the sherpa-onnx or whisper.cpp local integrations instead.

Cloud vs local

Note: Azure Speech-to-Text is a cloud service. Audio is sent to Microsoft's servers for processing. If your use case requires audio to stay entirely on-device, use the local integrations: sherpa-onnx (real-time streaming) or whisper.cpp (batch transcription).

Prerequisites

Set up Azure credentials

Azure Speech uses a subscription key and region. Simpler than AWS (IAM) or Google (service account JSON).

  1. Go to the Azure Portal
  2. Create a resource: search for "Speech" and select "Speech service"
  3. Select a pricing tier (Free F0 gives 5 hours/month at no cost)
  4. After creation, go to the resource and click "Keys and Endpoint"
  5. Copy Key 1 and the Region (e.g. australiaeast)

Configure credentials using one of these methods:

Option 1: .env file with dotenv

AZURE_SPEECH_KEY=your_subscription_key
AZURE_SPEECH_REGION=australiaeast

Option 2: Environment variables

export AZURE_SPEECH_KEY=your_subscription_key
export AZURE_SPEECH_REGION=australiaeast

Install packages

$ npm install decibri microsoft-cognitiveservices-speech-sdk dotenv

The dotenv package loads your credentials from a .env file. If you set environment variables another way, you can skip it.

No model download is required. All processing happens in Azure's cloud.

Code walkthrough

1. Configuration

Import decibri, the Azure Speech SDK, and dotenv. Create a SpeechConfig with your subscription key and region.

'use strict';
require('dotenv').config();

const Decibri = require('decibri');
const sdk = require('microsoft-cognitiveservices-speech-sdk');

const speechConfig = sdk.SpeechConfig.fromSubscription(
  process.env.AZURE_SPEECH_KEY,
  process.env.AZURE_SPEECH_REGION
);
speechConfig.speechRecognitionLanguage = 'en-US';

2. Create the push stream

Azure's Node.js SDK cannot access the microphone directly. Instead, create a PushAudioInputStream and wire it into a SpeechRecognizer. The default push stream format is 16 kHz, 16-bit, mono PCM, which matches decibri's default output exactly.

const pushStream = sdk.AudioInputStream.createPushStream();
const audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);
const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

3. Open the microphone and push audio

Create a decibri instance at 16 kHz mono and push each audio chunk into the push stream.

const mic = new Decibri({ sampleRate: 16000, channels: 1 });
mic.on('data', (chunk) => {
  pushStream.write(chunk.buffer.slice(
    chunk.byteOffset,
    chunk.byteOffset + chunk.byteLength
  ));
});
Buffer handling: pushStream.write() expects an ArrayBuffer, not a Node.js Buffer. Use chunk.buffer.slice(chunk.byteOffset, chunk.byteOffset + chunk.byteLength) to safely extract the underlying ArrayBuffer. This handles cases where Node.js pools Buffers with a shared backing store.

4. Handle results

Azure Speech uses event callbacks for results. Unlike other integrations that use async iteration or stream events, Azure fires recognizing and recognized events on the recognizer:

recognizer.recognizing = (s, e) => {
  process.stdout.write(`\r  [partial] ${e.result.text}                    `);
};

recognizer.recognized = (s, e) => {
  if (e.result.reason === sdk.ResultReason.RecognizedSpeech) {
    console.log(`\n  [final]   ${e.result.text}`);
  }
};

recognizer.canceled = (s, e) => {
  if (e.reason === sdk.CancellationReason.Error) {
    console.error(`Error: ${e.errorDetails}`);
  }
};

5. Start continuous recognition

Use startContinuousRecognitionAsync() for ongoing microphone input. Do not use recognizeOnceAsync(), which stops after a single utterance.

recognizer.startContinuousRecognitionAsync();

6. Clean shutdown

Stop recognition, close the recognizer, microphone, and push stream when the user presses Ctrl+C.

process.on('SIGINT', () => {
  console.log('\nStopping...');
  recognizer.stopContinuousRecognitionAsync(() => {
    recognizer.close();
    mic.stop();
    pushStream.close();
    process.exit(0);
  });
});

Recognition may process buffered audio after Ctrl+C before the session fully stops. This is normal.

Full example

View complete code
'use strict';
require('dotenv').config();

const Decibri = require('decibri');
const sdk = require('microsoft-cognitiveservices-speech-sdk');

async function main() {
  const speechConfig = sdk.SpeechConfig.fromSubscription(
    process.env.AZURE_SPEECH_KEY,
    process.env.AZURE_SPEECH_REGION
  );
  speechConfig.speechRecognitionLanguage = 'en-US';

  // Create push stream (default format: 16kHz, 16-bit, mono PCM)
  const pushStream = sdk.AudioInputStream.createPushStream();
  const audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);
  const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

  // Open microphone and push audio to Azure
  const mic = new Decibri({ sampleRate: 16000, channels: 1 });
  mic.on('data', (chunk) => {
    pushStream.write(chunk.buffer.slice(
      chunk.byteOffset,
      chunk.byteOffset + chunk.byteLength
    ));
  });

  console.log('Azure Speech-to-Text streaming test');
  console.log('Speak into your microphone. Press Ctrl+C to stop.\n');

  // Handle results
  recognizer.recognizing = (s, e) => {
    process.stdout.write(`\r  [partial] ${e.result.text}                    `);
  };

  recognizer.recognized = (s, e) => {
    if (e.result.reason === sdk.ResultReason.RecognizedSpeech) {
      console.log(`\n  [final]   ${e.result.text}`);
    }
  };

  recognizer.canceled = (s, e) => {
    if (e.reason === sdk.CancellationReason.Error) {
      console.error(`Error: ${e.errorDetails}`);
    }
  };

  recognizer.sessionStopped = (s, e) => {
    console.log('\nSession stopped.');
  };

  // Start continuous recognition
  recognizer.startContinuousRecognitionAsync(
    () => console.log('Recognition started.\n'),
    (err) => console.error('Error starting recognition:', err)
  );

  // Clean shutdown
  process.on('SIGINT', () => {
    console.log('\nStopping...');
    recognizer.stopContinuousRecognitionAsync(() => {
      recognizer.close();
      mic.stop();
      pushStream.close();
      process.exit(0);
    });
  });
}

main().catch(console.error);

Configuration options

The SpeechConfig controls how Azure processes your audio. Here are the most useful options:

Option Default Description
speechRecognitionLanguage 'en-US' BCP-47 language code (e.g. 'fr-FR', 'de-DE', 'ja-JP'). Supports 100+ languages.
outputFormat Simple Set to sdk.OutputFormat.Detailed for confidence scores and alternatives.
enableDictation false Enable dictation mode for longer speech with automatic punctuation.

See the Azure Speech-to-Text documentation for the complete list of options.

Free tier: Azure Speech Free F0 tier gives 5 hours of audio per month at no cost. No credit card required for the free tier.