Endpoints
Speech-to-Text
Text-to-Speech
Authentication
Audio Transcription
Converts audio files to text in the original language.Request Format
Headers:Authorization: Bearer YOUR_API_KEY(required)Content-Type: multipart/form-data(required)
| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | Audio file (max 25MB) |
model | string | Yes | Model identifier (e.g., whisper-1) |
language | string | No | ISO-639-1 language code |
prompt | string | No | Optional context/previous transcript |
response_format | string | No | Format: json, text, srt, verbose_json, vtt (default: json) |
temperature | float | No | Sampling temperature (0.0 to 1.0) |
timestamp_granularities | array | No | Timestamp levels: word, segment |
Response Format
JSON Response:Supported Audio Formats
flacmp3mp4mpegmpgam4aoggwavwebm
Audio Translation
Converts audio files to English text.Request Format
Same as transcription endpoint, but always outputs English text regardless of input language.Response Format
Same as transcription endpoint.Text-to-Speech
Generates audio from text input.Request Format
Headers:Authorization: Bearer YOUR_API_KEY(required)Content-Type: application/json(required)
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model: tts-1 or tts-1-hd |
input | string | Yes | Text to generate audio for (max 4096 chars) |
voice | string | Yes | Voice: alloy, echo, fable, onyx, nova, shimmer |
response_format | string | No | Format: mp3, opus, aac, flac, wav, pcm (default: mp3) |
speed | float | No | Speed: 0.25 to 4.0 (default: 1.0) |
Response Format
Returns audio file in the requested format as binary data.Usage Examples
Transcription Example
Translation Example
Text-to-Speech Example
Python Example
JavaScript Example
Error Responses
400 Bad Request
401 Unauthorized
413 Request Entity Too Large
Notes
- Maximum file size for audio uploads is 25MB
- The
whisper-1model supports multiple languages for transcription - Translation endpoint always outputs English text
- Text-to-speech supports multiple voices and output formats
- Response formats like SRT and VTT are useful for subtitle generation
- Timestamp granularities provide word-level timing information
- Speed parameter in TTS allows for faster or slower speech generation
Supported providers
Speech-to-Text (Transcription & Translation)
OpenAI
Industry-leading Whisper models for high-accuracy transcription and translation.
Azure
Microsoft Azure OpenAI Service with enterprise-grade Whisper model support.
Fireworks
Fast and efficient Whisper v3 models for scalable audio processing.
Text-to-Speech
OpenAI
High-quality TTS models with multiple voice options and natural speech synthesis.
Azure
Microsoft Azure OpenAI Service providing enterprise-ready text-to-speech capabilities.
Together
Cartesia Sonic model for efficient and customizable speech generation.