Skip to main content

Overview

The Realtime API enables real-time, low-latency conversations with AI models through WebSocket connections. It supports:
  • Voice input and output
  • Text interactions
  • Function/tool calling
  • Real-time transcription
  • Multi-modal conversations

Endpoints

POST /v1/realtime/sessions
POST /v1/realtime/transcription_sessions

Authentication

Authorization: Bearer <API_KEY>

Create Realtime Session

Creates a new realtime session and returns connection details.

Request Format

Endpoint: POST /v1/realtime/sessions Headers:
  • Authorization: Bearer YOUR_API_KEY (required)
  • Content-Type: application/json (required)
Request Body:
{
  "model": "gpt-4o-realtime-2024-10-01",
  "voice": "alloy",
  "instructions": "You are a helpful assistant.",
  "input_audio_format": "pcm16",
  "output_audio_format": "pcm16",
  "input_audio_transcription": {
    "enabled": true,
    "model": "whisper-1"
  },
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500
  },
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto",
  "temperature": 0.8,
  "max_output_tokens": 4096
}

Parameters

FieldTypeRequiredDescription
modelstringYesModel identifier (e.g., gpt-4o-realtime-2024-10-01)
voicestringNoVoice: alloy, echo, fable, onyx, nova, shimmer
instructionsstringNoSystem instructions for the assistant
input_audio_formatstringNoFormat: pcm16, g711_ulaw, g711_alaw
output_audio_formatstringNoFormat: pcm16, g711_ulaw, g711_alaw
input_audio_transcriptionobjectNoTranscription configuration
turn_detectionobjectNoVoice activity detection settings
toolsarrayNoAvailable functions/tools
tool_choicestringNoTool selection strategy: auto, none, required
temperaturefloatNoSampling temperature (0.6 to 1.2)
max_output_tokensintegerNoMaximum tokens to generate

Response Format

{
  "id": "realtime_session_abc123",
  "object": "realtime.session",
  "model": "gpt-4o-realtime-2024-10-01",
  "created_at": 1699123456,
  "expires_at": 1699127056,
  "client_secret": {
    "value": "rs_secret_abc123...",
    "expires_at": 1699127056
  }
}

Create Transcription Session

Creates a session focused on audio transcription.

Request Format

Endpoint: POST /v1/realtime/transcription_sessions Request Body:
{
  "model": "whisper-1",
  "language": "en",
  "temperature": 0.0,
  "prompt": "Previous context for better transcription"
}

Parameters

FieldTypeRequiredDescription
modelstringYesTranscription model (e.g., whisper-1)
languagestringNoISO-639-1 language code
temperaturefloatNoSampling temperature (0.0 to 1.0)
promptstringNoContext for improved accuracy

WebSocket Connection

After creating a session, establish a WebSocket connection using the client secret.

Connection URL

wss://api.example.com/v1/realtime/ws

Connection Headers

{
  "Authorization": "Bearer rs_secret_abc123...",
  "Sec-WebSocket-Protocol": "realtime"
}

WebSocket Protocol

Client Events

Session Update:
{
  "type": "session.update",
  "session": {
    "modalities": ["text", "audio"],
    "voice": "nova",
    "temperature": 0.7
  }
}
Audio Input:
{
  "type": "input_audio_buffer.append",
  "audio": "base64_encoded_audio_data"
}
Text Input:
{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Hello, how are you?"
      }
    ]
  }
}
Generate Response:
{
  "type": "response.create",
  "response": {
    "modalities": ["text", "audio"]
  }
}

Server Events

Session Created:
{
  "type": "session.created",
  "session": {
    "id": "sess_abc123",
    "model": "gpt-4o-realtime-2024-10-01",
    "modalities": ["text", "audio"],
    "voice": "alloy"
  }
}
Audio Transcript:
{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_abc123",
  "content_index": 0,
  "transcript": "Hello, how are you?"
}
Audio Delta:
{
  "type": "response.audio.delta",
  "response_id": "resp_abc123",
  "item_id": "item_def456",
  "output_index": 0,
  "content_index": 0,
  "delta": "base64_encoded_audio_chunk"
}
Text Delta:
{
  "type": "response.text.delta",
  "response_id": "resp_abc123",
  "item_id": "item_def456",
  "output_index": 0,
  "content_index": 0,
  "delta": "I'm doing well, thank you!"
}
Function Call:
{
  "type": "response.function_call_arguments.done",
  "response_id": "resp_abc123",
  "item_id": "item_def456",
  "output_index": 0,
  "call_id": "call_ghi789",
  "name": "get_weather",
  "arguments": "{\"location\": \"San Francisco\"}"
}
Error Event:
{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_audio_format",
    "message": "Audio format not supported"
  }
}

Usage Examples

Python Example

import asyncio
import websockets
import json
import base64
import requests

# Create session
response = requests.post(
    "http://localhost:3000/v1/realtime/sessions",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "gpt-4o-realtime-2024-10-01",
        "voice": "nova",
        "instructions": "You are a helpful voice assistant.",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 500
        }
    }
)

session = response.json()
client_secret = session["client_secret"]["value"]

# WebSocket connection
async def realtime_conversation():
    uri = "wss://localhost:3000/v1/realtime/ws"
    headers = {
        "Authorization": f"Bearer {client_secret}",
        "Sec-WebSocket-Protocol": "realtime"
    }

    async with websockets.connect(uri, extra_headers=headers) as websocket:
        # Send audio input
        with open("audio_input.raw", "rb") as f:
            audio_data = f.read()
            await websocket.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_data).decode()
            }))

        # Commit audio buffer
        await websocket.send(json.dumps({
            "type": "input_audio_buffer.commit"
        }))

        # Create response
        await websocket.send(json.dumps({
            "type": "response.create"
        }))

        # Handle server events
        while True:
            message = await websocket.recv()
            event = json.loads(message)

            if event["type"] == "response.audio.delta":
                # Process audio chunk
                audio_chunk = base64.b64decode(event["delta"])
                # Play or save audio chunk

            elif event["type"] == "response.text.delta":
                # Process text chunk
                print(event["delta"], end="", flush=True)

            elif event["type"] == "response.done":
                break

asyncio.run(realtime_conversation())

JavaScript Example

const WebSocket = require('ws');

// Create session
async function createSession() {
  const response = await fetch('http://localhost:3000/v1/realtime/sessions', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'gpt-4o-realtime-2024-10-01',
      voice: 'echo',
      instructions: 'You are a helpful assistant.',
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16'
    })
  });

  return await response.json();
}

// WebSocket conversation
async function startConversation() {
  const session = await createSession();
  const clientSecret = session.client_secret.value;

  const ws = new WebSocket('wss://localhost:3000/v1/realtime/ws', {
    headers: {
      'Authorization': `Bearer ${clientSecret}`,
      'Sec-WebSocket-Protocol': 'realtime'
    }
  });

  ws.on('open', () => {
    // Send text message
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [{
          type: 'input_text',
          text: 'Tell me a joke!'
        }]
      }
    }));

    // Generate response
    ws.send(JSON.stringify({
      type: 'response.create',
      response: {
        modalities: ['text', 'audio']
      }
    }));
  });

  ws.on('message', (data) => {
    const event = JSON.parse(data);

    switch(event.type) {
      case 'response.text.delta':
        process.stdout.write(event.delta);
        break;

      case 'response.audio.delta':
        // Handle audio data
        const audioChunk = Buffer.from(event.delta, 'base64');
        // Play or save audio
        break;

      case 'response.done':
        console.log('\nResponse complete');
        ws.close();
        break;

      case 'error':
        console.error('Error:', event.error.message);
        break;
    }
  });
}

startConversation().catch(console.error);

Browser Example

// Create session from backend
async function getSessionCredentials() {
  const response = await fetch('/api/create-realtime-session', {
    headers: {
      'Authorization': 'Bearer YOUR_USER_TOKEN'
    }
  });
  return await response.json();
}

// Browser WebSocket connection
async function startVoiceChat() {
  const { clientSecret, wsUrl } = await getSessionCredentials();

  const ws = new WebSocket(wsUrl, ['realtime']);
  ws.binaryType = 'arraybuffer';

  // Get user media
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const mediaRecorder = new MediaRecorder(stream);

  mediaRecorder.ondataavailable = async (event) => {
    if (event.data.size > 0) {
      const arrayBuffer = await event.data.arrayBuffer();
      const base64Audio = btoa(String.fromCharCode(...new Uint8Array(arrayBuffer)));

      ws.send(JSON.stringify({
        type: 'input_audio_buffer.append',
        audio: base64Audio
      }));
    }
  };

  ws.onopen = () => {
    // Configure session
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        input_audio_transcription: { enabled: true }
      }
    }));

    // Start recording
    mediaRecorder.start(100); // Send chunks every 100ms
  };

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.type === 'response.audio.delta') {
      // Play audio chunk
      playAudioChunk(data.delta);
    } else if (data.type === 'conversation.item.input_audio_transcription.completed') {
      console.log('You said:', data.transcript);
    }
  };

  // Stop recording on button click
  document.getElementById('stop-btn').onclick = () => {
    mediaRecorder.stop();
    ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
    ws.send(JSON.stringify({ type: 'response.create' }));
  };
}

Turn Detection

Server VAD (Voice Activity Detection)

{
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500
  }
}

Manual Turn Detection

{
  "turn_detection": {
    "type": "none"
  }
}
With manual detection, explicitly commit audio and create responses:
{
  "type": "input_audio_buffer.commit"
}

Error Handling

Connection Errors

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
  // Implement reconnection logic
};

ws.onclose = (event) => {
  if (event.code === 1006) {
    console.error('Abnormal closure, likely auth failure');
  }
};

API Errors

Common error types:
  • invalid_request_error: Malformed request
  • authentication_error: Invalid credentials
  • rate_limit_error: Too many requests
  • server_error: Internal server error

Best Practices

  • Audio Format: Use PCM16 for best quality and compatibility
  • Chunking: Send audio in small chunks (100-200ms) for smooth streaming
  • Error Recovery: Implement reconnection logic for network failures
  • Resource Management: Close sessions when done to free resources
  • Turn Detection: Use server VAD for natural conversations
  • Buffering: Buffer audio output for smooth playback
  • Monitoring: Track latency and audio quality metrics

Limitations

  • Session Duration: Maximum 60 minutes per session
  • Concurrent Sessions: 10 sessions per API key
  • Rate Limits: 100 session creations per minute
  • Audio Formats: Limited to PCM16, G711 μ-law, G711 A-law
  • Model Support: Only specific models support realtime features

Supported providers

Realtime Sessions

OpenAI

Full support for realtime audio/text sessions with GPT-4 Realtime models, voice capabilities, and function calling.

Realtime Transcription

OpenAI

Realtime transcription capabilities with low-latency speech-to-text processing.