Realtime API

Overview

The Realtime API enables real-time, low-latency conversations with AI models through WebSocket connections. It supports:

Voice input and output
Text interactions
Function/tool calling
Real-time transcription
Multi-modal conversations

Endpoints

POST /v1/realtime/sessions
POST /v1/realtime/transcription_sessions

Authentication

Authorization: Bearer <API_KEY>

Create Realtime Session

Creates a new realtime session and returns connection details.

Request Format

Endpoint: POST /v1/realtime/sessions Headers:

Authorization: Bearer YOUR_API_KEY (required)
Content-Type: application/json (required)

Request Body:

{
  "model": "gpt-4o-realtime-2024-10-01",
  "voice": "alloy",
  "instructions": "You are a helpful assistant.",
  "input_audio_format": "pcm16",
  "output_audio_format": "pcm16",
  "input_audio_transcription": {
    "enabled": true,
    "model": "whisper-1"
  },
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500
  },
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto",
  "temperature": 0.8,
  "max_output_tokens": 4096
}

Parameters

Field	Type	Required	Description
`model`	string	Yes	Model identifier (e.g., `gpt-4o-realtime-2024-10-01`)
`voice`	string	No	Voice: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`
`instructions`	string	No	System instructions for the assistant
`input_audio_format`	string	No	Format: `pcm16`, `g711_ulaw`, `g711_alaw`
`output_audio_format`	string	No	Format: `pcm16`, `g711_ulaw`, `g711_alaw`
`input_audio_transcription`	object	No	Transcription configuration
`turn_detection`	object	No	Voice activity detection settings
`tools`	array	No	Available functions/tools
`tool_choice`	string	No	Tool selection strategy: `auto`, `none`, `required`
`temperature`	float	No	Sampling temperature (0.6 to 1.2)
`max_output_tokens`	integer	No	Maximum tokens to generate

Response Format

{
  "id": "realtime_session_abc123",
  "object": "realtime.session",
  "model": "gpt-4o-realtime-2024-10-01",
  "created_at": 1699123456,
  "expires_at": 1699127056,
  "client_secret": {
    "value": "rs_secret_abc123...",
    "expires_at": 1699127056
  }
}

Create Transcription Session

Creates a session focused on audio transcription.

Request Format

Endpoint: POST /v1/realtime/transcription_sessions Request Body:

{
  "model": "whisper-1",
  "language": "en",
  "temperature": 0.0,
  "prompt": "Previous context for better transcription"
}

Parameters

Field	Type	Required	Description
`model`	string	Yes	Transcription model (e.g., `whisper-1`)
`language`	string	No	ISO-639-1 language code
`temperature`	float	No	Sampling temperature (0.0 to 1.0)
`prompt`	string	No	Context for improved accuracy

WebSocket Connection

After creating a session, establish a WebSocket connection using the client secret.

Connection URL

wss://api.example.com/v1/realtime/ws

Connection Headers

{
  "Authorization": "Bearer rs_secret_abc123...",
  "Sec-WebSocket-Protocol": "realtime"
}

WebSocket Protocol

Client Events

Session Update:

{
  "type": "session.update",
  "session": {
    "modalities": ["text", "audio"],
    "voice": "nova",
    "temperature": 0.7
  }
}

Audio Input:

{
  "type": "input_audio_buffer.append",
  "audio": "base64_encoded_audio_data"
}

Text Input:

{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Hello, how are you?"
      }
    ]
  }
}

Generate Response:

{
  "type": "response.create",
  "response": {
    "modalities": ["text", "audio"]
  }
}

Server Events

Session Created:

{
  "type": "session.created",
  "session": {
    "id": "sess_abc123",
    "model": "gpt-4o-realtime-2024-10-01",
    "modalities": ["text", "audio"],
    "voice": "alloy"
  }
}

Audio Transcript:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_abc123",
  "content_index": 0,
  "transcript": "Hello, how are you?"
}

Audio Delta:

{
  "type": "response.audio.delta",
  "response_id": "resp_abc123",
  "item_id": "item_def456",
  "output_index": 0,
  "content_index": 0,
  "delta": "base64_encoded_audio_chunk"
}

Text Delta:

{
  "type": "response.text.delta",
  "response_id": "resp_abc123",
  "item_id": "item_def456",
  "output_index": 0,
  "content_index": 0,
  "delta": "I'm doing well, thank you!"
}

Function Call:

{
  "type": "response.function_call_arguments.done",
  "response_id": "resp_abc123",
  "item_id": "item_def456",
  "output_index": 0,
  "call_id": "call_ghi789",
  "name": "get_weather",
  "arguments": "{\"location\": \"San Francisco\"}"
}

Error Event:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_audio_format",
    "message": "Audio format not supported"
  }
}

Usage Examples

Python Example

import asyncio
import websockets
import json
import base64
import requests

# Create session
response = requests.post(
    "http://localhost:3000/v1/realtime/sessions",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "gpt-4o-realtime-2024-10-01",
        "voice": "nova",
        "instructions": "You are a helpful voice assistant.",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 500
        }
    }
)

session = response.json()
client_secret = session["client_secret"]["value"]

# WebSocket connection
async def realtime_conversation():
    uri = "wss://localhost:3000/v1/realtime/ws"
    headers = {
        "Authorization": f"Bearer {client_secret}",
        "Sec-WebSocket-Protocol": "realtime"
    }

    async with websockets.connect(uri, extra_headers=headers) as websocket:
        # Send audio input
        with open("audio_input.raw", "rb") as f:
            audio_data = f.read()
            await websocket.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_data).decode()
            }))

        # Commit audio buffer
        await websocket.send(json.dumps({
            "type": "input_audio_buffer.commit"
        }))

        # Create response
        await websocket.send(json.dumps({
            "type": "response.create"
        }))

        # Handle server events
        while True:
            message = await websocket.recv()
            event = json.loads(message)

            if event["type"] == "response.audio.delta":
                # Process audio chunk
                audio_chunk = base64.b64decode(event["delta"])
                # Play or save audio chunk

            elif event["type"] == "response.text.delta":
                # Process text chunk
                print(event["delta"], end="", flush=True)

            elif event["type"] == "response.done":
                break

asyncio.run(realtime_conversation())

JavaScript Example

const WebSocket = require('ws');

// Create session
async function createSession() {
  const response = await fetch('http://localhost:3000/v1/realtime/sessions', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'gpt-4o-realtime-2024-10-01',
      voice: 'echo',
      instructions: 'You are a helpful assistant.',
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16'
    })
  });

  return await response.json();
}

// WebSocket conversation
async function startConversation() {
  const session = await createSession();
  const clientSecret = session.client_secret.value;

  const ws = new WebSocket('wss://localhost:3000/v1/realtime/ws', {
    headers: {
      'Authorization': `Bearer ${clientSecret}`,
      'Sec-WebSocket-Protocol': 'realtime'
    }
  });

  ws.on('open', () => {
    // Send text message
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [{
          type: 'input_text',
          text: 'Tell me a joke!'
        }]
      }
    }));

    // Generate response
    ws.send(JSON.stringify({
      type: 'response.create',
      response: {
        modalities: ['text', 'audio']
      }
    }));
  });

  ws.on('message', (data) => {
    const event = JSON.parse(data);

    switch(event.type) {
      case 'response.text.delta':
        process.stdout.write(event.delta);
        break;

      case 'response.audio.delta':
        // Handle audio data
        const audioChunk = Buffer.from(event.delta, 'base64');
        // Play or save audio
        break;

      case 'response.done':
        console.log('\nResponse complete');
        ws.close();
        break;

      case 'error':
        console.error('Error:', event.error.message);
        break;
    }
  });
}

startConversation().catch(console.error);

Browser Example

// Create session from backend
async function getSessionCredentials() {
  const response = await fetch('/api/create-realtime-session', {
    headers: {
      'Authorization': 'Bearer YOUR_USER_TOKEN'
    }
  });
  return await response.json();
}

// Browser WebSocket connection
async function startVoiceChat() {
  const { clientSecret, wsUrl } = await getSessionCredentials();

  const ws = new WebSocket(wsUrl, ['realtime']);
  ws.binaryType = 'arraybuffer';

  // Get user media
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const mediaRecorder = new MediaRecorder(stream);

  mediaRecorder.ondataavailable = async (event) => {
    if (event.data.size > 0) {
      const arrayBuffer = await event.data.arrayBuffer();
      const base64Audio = btoa(String.fromCharCode(...new Uint8Array(arrayBuffer)));

      ws.send(JSON.stringify({
        type: 'input_audio_buffer.append',
        audio: base64Audio
      }));
    }
  };

  ws.onopen = () => {
    // Configure session
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        input_audio_transcription: { enabled: true }
      }
    }));

    // Start recording
    mediaRecorder.start(100); // Send chunks every 100ms
  };

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.type === 'response.audio.delta') {
      // Play audio chunk
      playAudioChunk(data.delta);
    } else if (data.type === 'conversation.item.input_audio_transcription.completed') {
      console.log('You said:', data.transcript);
    }
  };

  // Stop recording on button click
  document.getElementById('stop-btn').onclick = () => {
    mediaRecorder.stop();
    ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
    ws.send(JSON.stringify({ type: 'response.create' }));
  };
}

Turn Detection

Server VAD (Voice Activity Detection)

{
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500
  }
}

Manual Turn Detection

{
  "turn_detection": {
    "type": "none"
  }
}

With manual detection, explicitly commit audio and create responses:

{
  "type": "input_audio_buffer.commit"
}

Error Handling

Connection Errors

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
  // Implement reconnection logic
};

ws.onclose = (event) => {
  if (event.code === 1006) {
    console.error('Abnormal closure, likely auth failure');
  }
};

API Errors

Common error types:

invalid_request_error: Malformed request
authentication_error: Invalid credentials
rate_limit_error: Too many requests
server_error: Internal server error

Best Practices

Audio Format: Use PCM16 for best quality and compatibility
Chunking: Send audio in small chunks (100-200ms) for smooth streaming
Error Recovery: Implement reconnection logic for network failures
Resource Management: Close sessions when done to free resources
Turn Detection: Use server VAD for natural conversations
Buffering: Buffer audio output for smooth playback
Monitoring: Track latency and audio quality metrics

Limitations

Session Duration: Maximum 60 minutes per session
Concurrent Sessions: 10 sessions per API key
Rate Limits: 100 session creations per minute
Audio Formats: Limited to PCM16, G711 μ-law, G711 A-law
Model Support: Only specific models support realtime features

Supported providers

Realtime Sessions

OpenAI

Full support for realtime audio/text sessions with GPT-4 Realtime models, voice capabilities, and function calling.

Realtime Transcription

OpenAI

Realtime transcription capabilities with low-latency speech-to-text processing.

Self-Hosting

Development

OpenAI-Compatible API

Features

Cluster Setup

Overview

Endpoints

Authentication

Create Realtime Session

Request Format

Parameters

Response Format

Create Transcription Session

Request Format

Parameters

WebSocket Connection

Connection URL

Connection Headers

WebSocket Protocol

Client Events

Server Events

Usage Examples

Python Example

JavaScript Example

Browser Example

Turn Detection

Server VAD (Voice Activity Detection)

Manual Turn Detection

Error Handling

Connection Errors

API Errors

Best Practices

Limitations

Supported providers

Realtime Sessions

OpenAI

Realtime Transcription

OpenAI

Self-Hosting

Development

OpenAI-Compatible API

Features

Cluster Setup

​Overview

​Endpoints

​Authentication

​Create Realtime Session

​Request Format

​Parameters

​Response Format

​Create Transcription Session

​Request Format

​Parameters

​WebSocket Connection

​Connection URL

​Connection Headers

​WebSocket Protocol

​Client Events

​Server Events

​Usage Examples

​Python Example

​JavaScript Example

​Browser Example

​Turn Detection

​Server VAD (Voice Activity Detection)

​Manual Turn Detection

​Error Handling

​Connection Errors

​API Errors

​Best Practices

​Limitations

​Supported providers

​Realtime Sessions

OpenAI

​Realtime Transcription

OpenAI

Overview

Endpoints

Authentication

Create Realtime Session

Request Format

Parameters

Response Format

Create Transcription Session

Request Format

Parameters

WebSocket Connection

Connection URL

Connection Headers

WebSocket Protocol

Client Events

Server Events

Usage Examples

Python Example

JavaScript Example

Browser Example

Turn Detection

Server VAD (Voice Activity Detection)

Manual Turn Detection

Error Handling

Connection Errors

API Errors

Best Practices

Limitations

Supported providers

Realtime Sessions

Realtime Transcription