Overview
The Realtime API enables real-time, low-latency conversations with AI models through WebSocket connections. It supports:- Voice input and output
- Text interactions
- Function/tool calling
- Real-time transcription
- Multi-modal conversations
Endpoints
Authentication
Create Realtime Session
Creates a new realtime session and returns connection details.Request Format
Endpoint:POST /v1/realtime/sessions
Headers:
Authorization: Bearer YOUR_API_KEY(required)Content-Type: application/json(required)
Parameters
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model identifier (e.g., gpt-4o-realtime-2024-10-01) |
voice | string | No | Voice: alloy, echo, fable, onyx, nova, shimmer |
instructions | string | No | System instructions for the assistant |
input_audio_format | string | No | Format: pcm16, g711_ulaw, g711_alaw |
output_audio_format | string | No | Format: pcm16, g711_ulaw, g711_alaw |
input_audio_transcription | object | No | Transcription configuration |
turn_detection | object | No | Voice activity detection settings |
tools | array | No | Available functions/tools |
tool_choice | string | No | Tool selection strategy: auto, none, required |
temperature | float | No | Sampling temperature (0.6 to 1.2) |
max_output_tokens | integer | No | Maximum tokens to generate |
Response Format
Create Transcription Session
Creates a session focused on audio transcription.Request Format
Endpoint:POST /v1/realtime/transcription_sessions
Request Body:
Parameters
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Transcription model (e.g., whisper-1) |
language | string | No | ISO-639-1 language code |
temperature | float | No | Sampling temperature (0.0 to 1.0) |
prompt | string | No | Context for improved accuracy |
WebSocket Connection
After creating a session, establish a WebSocket connection using the client secret.Connection URL
Connection Headers
WebSocket Protocol
Client Events
Session Update:Server Events
Session Created:Usage Examples
Python Example
JavaScript Example
Browser Example
Turn Detection
Server VAD (Voice Activity Detection)
Manual Turn Detection
Error Handling
Connection Errors
API Errors
Common error types:invalid_request_error: Malformed requestauthentication_error: Invalid credentialsrate_limit_error: Too many requestsserver_error: Internal server error
Best Practices
- Audio Format: Use PCM16 for best quality and compatibility
- Chunking: Send audio in small chunks (100-200ms) for smooth streaming
- Error Recovery: Implement reconnection logic for network failures
- Resource Management: Close sessions when done to free resources
- Turn Detection: Use server VAD for natural conversations
- Buffering: Buffer audio output for smooth playback
- Monitoring: Track latency and audio quality metrics
Limitations
- Session Duration: Maximum 60 minutes per session
- Concurrent Sessions: 10 sessions per API key
- Rate Limits: 100 session creations per minute
- Audio Formats: Limited to PCM16, G711 μ-law, G711 A-law
- Model Support: Only specific models support realtime features
Supported providers
Realtime Sessions
OpenAI
Full support for realtime audio/text sessions with GPT-4 Realtime models, voice capabilities, and function calling.
Realtime Transcription
OpenAI
Realtime transcription capabilities with low-latency speech-to-text processing.