ElevenLabs ASR

ElevenLabs Automatic Speech Recognition (ASR) API Reference

The ElevenLabs ASR API provides real-time speech-to-text transcription using ElevenLabs' Scribe v2 model. This WebSocket-based endpoint enables low-latency streaming recognition for live audio processing with voice activity detection and partial transcript delivery.

Base URL: wss://api.openmind.com

Authentication: Requires an OpenMind API key passed as a query parameter.

Endpoint Overview

Protocol
Endpoint
Description

WebSocket

/api/core/elevenlabs/asr

Real-time speech recognition with ElevenLabs Scribe v2

Note: The endpoint also supports the /api/core/v1/ prefix for API versioning (e.g., /api/core/v1/elevenlabs/asr).

WebSocket Connection

Establish a persistent WebSocket connection for streaming audio data and receiving real-time transcription results.

Endpoint: wss://api.openmind.com/api/core/elevenlabs/asr?api_key=YOUR_API_KEY

Connection Parameters

Parameter
Type
Required
Description

api_key

string

Yes

Your OpenMind API key for authentication

Connection Example

# Using wscat (install with: npm install -g wscat)
wscat -c "wss://api.openmind.com/api/core/elevenlabs/asr?api_key=om1_live_your_api_key"

Connection Response

Upon successful connection, you'll receive a confirmation message:

Connection Errors

401 Unauthorized - Missing API Key:

401 Unauthorized - Invalid API Key:

Sending Audio Data

Message Format

Send audio data as JSON messages over the WebSocket connection:

Message Fields

Field
Type
Required
Default
Description

audio

string

Yes

-

Base64-encoded raw PCM audio data

rate

integer

No

16000

Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 44100

language_code

string

No

"auto"

BCP-47 language code (e.g., "en", "es", "fr"). Use "auto" or omit for automatic language detection

Note: The rate and language_code parameters only need to be sent with the first message. Subsequent messages can contain only the audio field.

Audio Format Mapping

The sample rate determines the PCM format sent to ElevenLabs:

Sample Rate (Hz)
PCM Format

8000

pcm_8000

16000

pcm_16000 (default)

22050

pcm_22050

44100

pcm_44100

Recommendation: Use 16000 Hz for the best balance of quality and bandwidth.

Receiving Transcription Results

The service delivers two types of transcription events as results become available.

Partial Transcript

Intermediate, in-progress transcription result emitted as the user speaks:

Committed Transcript

Final, committed transcription result for a completed utterance:

Response Fields

Field
Type
Description

asr_reply

string

Transcription text (partial or final)

clientId

string

Unique identifier for the WebSocket session

type

string

Message type: "connection", "partial", "error", "info" (absent on committed transcripts)

message

string

Human-readable message for connection, info, or error events

time

integer

Unix timestamp in milliseconds when the result was produced

Info Messages

The server may send informational messages during operation, such as when a recognition session is automatically restarted:

Error Messages

Session Limits

Limit
Value
Description

Max streaming duration

5 minutes

Each internal ElevenLabs session is capped at 5 minutes; the session automatically restarts

Silence timeout

Configurable

The connection closes after a period of silence with no detected speech

When the 5-minute streaming limit is reached, the session is seamlessly restarted and an "info" message is sent to the client. Non-recoverable errors will close the WebSocket.

Audio Specifications

Supported Audio Format

  • Encoding: Raw PCM (signed 16-bit little-endian)

  • Sample Rate: 8000, 16000 (recommended), 22050, or 44100 Hz

  • Channels: Mono (1 channel)

Calculating Audio Length

Audio duration is calculated as:

duration (s)=audio bytessample rate×2×1\text{duration (s)} = \frac{\text{audio bytes}}{\text{sample rate} \times 2 \times 1}

For 16000 Hz mono 16-bit PCM:

duration (s)=audio bytes32000\text{duration (s)} = \frac{\text{audio bytes}}{32000}

Usage Examples

Python Example

JavaScript/Node.js Example

Using wscat (Command Line)

Recording Audio for Testing

Using SoX:

Using FFmpeg:

Language Support

Pass a BCP-47 language code in the first message to pin recognition to a specific language. Omit the field or use "auto" to let ElevenLabs detect the language automatically.

Language
Code

English

en

Spanish

es

French

fr

German

de

Italian

it

Portuguese

pt

Japanese

ja

Korean

ko

Chinese (Mandarin)

zh

Dutch

nl

Polish

pl

Russian

ru

Note: For a complete list of supported languages, refer to the ElevenLabs Speech-to-Text documentation.

Last updated

Was this helpful?