Google Cloud Speech-to-Text Streaming

Stream live microphone audio to Google Cloud Speech-to-Text using decibri and a single pipe() call.

What this does

This is the simplest cloud integration in the decibri docs. Google's streamingRecognize() returns a standard Node.js writable stream. decibri is a Readable stream. The entire integration is one line:

mic.pipe(recognizeStream);

No async generator (like AWS Transcribe), no WebSocket management (like Deepgram or OpenAI). Just standard Node.js streams piped together over gRPC.

Choose this when you need managed cloud transcription with Google's speech models, support for 125+ languages, or the simplest possible integration code. For use cases where audio must stay on your device, see the sherpa-onnx or whisper.cpp local integrations instead.

Cloud vs local

Note: Google Cloud Speech-to-Text is a cloud service. Audio is sent to Google's servers for processing. If your use case requires audio to stay entirely on-device, use the local integrations: sherpa-onnx (real-time streaming) or whisper.cpp (batch transcription).

Prerequisites

Set up Google Cloud credentials

Google Cloud uses a service account JSON key file for authentication.

  1. Go to the Google Cloud Console
  2. Create a project (or select an existing one)
  3. Enable the Cloud Speech-to-Text API: go to APIs & Services > Library, search "Speech-to-Text", and click Enable
  4. Create a service account: go to IAM & Admin > Service Accounts > Create Service Account
  5. Create a JSON key: click on the service account > Keys tab > Add Key > Create new key > JSON
  6. A JSON file downloads automatically. Store it securely.

Configure credentials using one of these methods:

Option 1: Environment variable (recommended)

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-key-file.json"

Option 2: .env file with dotenv

GOOGLE_APPLICATION_CREDENTIALS=./your-key-file.json

Option 3: Direct in code (testing only)

const client = new speech.SpeechClient({
  keyFilename: './your-key-file.json'
});

The SDK checks for the GOOGLE_APPLICATION_CREDENTIALS environment variable automatically.

Install packages

$ npm install decibri @google-cloud/speech dotenv

The dotenv package loads your credentials path from a .env file. If you set the environment variable another way, you can skip it.

No model download is required. All processing happens in Google's cloud.

Streaming limit: Google Cloud Speech-to-Text has a 5-minute limit per streaming session. After 5 minutes, the stream closes and you must create a new one. This is fine for voice commands and short interactions. For continuous transcription, see Google's endless streaming tutorial.

Code walkthrough

1. Configuration

Import decibri, the Google Cloud Speech SDK, and dotenv. Create a client.

'use strict';
require('dotenv').config();

const Decibri = require('decibri');
const speech = require('@google-cloud/speech');

const client = new speech.SpeechClient();

2. Configure the recognition request

Set the audio encoding, sample rate, language, and whether you want interim (partial) results.

const request = {
  config: {
    encoding: 'LINEAR16',
    sampleRateHertz: 16000,
    languageCode: 'en-US',
  },
  interimResults: true,
};
Note: LINEAR16 is Google's name for PCM signed 16-bit little-endian, the exact format decibri outputs by default. No conversion needed.

3. Create the recognize stream

Call streamingRecognize() to get a writable stream that accepts audio and emits transcription results.

const recognizeStream = client
  .streamingRecognize(request)
  .on('error', console.error)
  .on('data', (data) => {
    const result = data.results[0];
    if (result && result.alternatives[0]) {
      const transcript = result.alternatives[0].transcript;
      if (result.isFinal) {
        console.log(`\n  [final]   ${transcript}`);
      } else {
        process.stdout.write(`\r  [partial] ${transcript}                    `);
      }
    }
  });

4. Open the microphone and pipe

Create a decibri instance at 16 kHz mono and pipe it directly into the recognize stream. This is the entire integration. No bridge code, no async generator, no chunk wrapping.

const mic = new Decibri({ sampleRate: 16000, channels: 1 });
mic.pipe(recognizeStream);

5. Handle results

Google Cloud Speech-to-Text returns both interim and final results in the stream:

Results are handled in the data event listener on the recognize stream (shown in step 3). Each event contains a results array with one or more alternatives, each including a transcript string and confidence score.

6. Clean shutdown

Stop the microphone and exit when the user presses Ctrl+C.

process.on('SIGINT', () => {
  console.log('\nStopping...');
  process.exit(0);
});

Full example

View complete code
'use strict';
require('dotenv').config();

const Decibri = require('decibri');
const speech = require('@google-cloud/speech');

async function main() {
  // Create client (uses GOOGLE_APPLICATION_CREDENTIALS env var)
  const client = new speech.SpeechClient();

  // Configure streaming recognition
  const request = {
    config: {
      encoding: 'LINEAR16',
      sampleRateHertz: 16000,
      languageCode: 'en-US',
    },
    interimResults: true,
  };

  // Create the recognize stream
  const recognizeStream = client
    .streamingRecognize(request)
    .on('error', console.error)
    .on('data', (data) => {
      const result = data.results[0];
      if (result && result.alternatives[0]) {
        const transcript = result.alternatives[0].transcript;
        if (result.isFinal) {
          console.log(`\n  [final]   ${transcript}`);
        } else {
          process.stdout.write(`\r  [partial] ${transcript}                    `);
        }
      }
    });

  // Open microphone and pipe to Google
  const mic = new Decibri({ sampleRate: 16000, channels: 1 });
  mic.pipe(recognizeStream);

  console.log('Google Speech-to-Text streaming test');
  console.log('Speak into your microphone. Press Ctrl+C to stop.\n');
}

// ── Cleanup on Ctrl+C ──────────────────────────────────────
process.on('SIGINT', () => {
  console.log('\nStopping...');
  process.exit(0);
});

main().catch(console.error);

Configuration options

The request config controls how Google processes your audio. Here are the most useful options:

Option Default Description
encoding (required) Audio encoding. Use 'LINEAR16' for decibri's Int16 output.
sampleRateHertz (required) Sample rate. Use 16000 to match decibri default. Supports 8000–48000 Hz.
languageCode (required) BCP-47 language code (e.g. 'en-US', 'fr-FR', 'de-DE'). Supports 125+ languages.
interimResults false Set true for partial results as speech is recognized.
model 'default' Recognition model. Options: 'default', 'latest_long', 'latest_short', 'phone_call', 'video'.
singleUtterance false Set true to stop after detecting end of a single spoken phrase. Useful for voice commands.

See the Google Cloud Speech-to-Text streaming API reference for the complete list of options.

Free tier: New Google Cloud accounts get $300 in credit. After that, the first 60 minutes of audio per month are free.