Audio API

Endpoints

Speech-to-Text

POST /v1/audio/transcriptions
POST /v1/audio/translations

Text-to-Speech

POST /v1/audio/speech

Authentication

Authorization: Bearer <API_KEY>

Audio Transcription

Converts audio files to text in the original language.

Request Format

Headers:

Authorization: Bearer YOUR_API_KEY (required)
Content-Type: multipart/form-data (required)

Form Data:

Field	Type	Required	Description
`file`	file	Yes	Audio file (max 25MB)
`model`	string	Yes	Model identifier (e.g., `whisper-1`)
`language`	string	No	ISO-639-1 language code
`prompt`	string	No	Optional context/previous transcript
`response_format`	string	No	Format: `json`, `text`, `srt`, `verbose_json`, `vtt` (default: `json`)
`temperature`	float	No	Sampling temperature (0.0 to 1.0)
`timestamp_granularities`	array	No	Timestamp levels: `word`, `segment`

Response Format

JSON Response:

{
  "text": "Transcribed text here"
}

Verbose JSON Response:

{
  "task": "transcribe",
  "language": "en",
  "duration": 5.32,
  "text": "Transcribed text here",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Segment text",
      "tokens": [50364, 6550, ...],
      "temperature": 0.0,
      "avg_logprob": -0.23,
      "compression_ratio": 1.2,
      "no_speech_prob": 0.01
    }
  ]
}

Supported Audio Formats

flac
mp3
mp4
mpeg
mpga
m4a
ogg
wav
webm

Audio Translation

Converts audio files to English text.

Request Format

Same as transcription endpoint, but always outputs English text regardless of input language.

Response Format

Same as transcription endpoint.

Text-to-Speech

Generates audio from text input.

Request Format

Headers:

Authorization: Bearer YOUR_API_KEY (required)
Content-Type: application/json (required)

Request Body:

{
  "model": "tts-1",
  "input": "Text to convert to speech",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}

Field	Type	Required	Description
`model`	string	Yes	Model: `tts-1` or `tts-1-hd`
`input`	string	Yes	Text to generate audio for (max 4096 chars)
`voice`	string	Yes	Voice: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`
`response_format`	string	No	Format: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm` (default: `mp3`)
`speed`	float	No	Speed: 0.25 to 4.0 (default: 1.0)

Response Format

Returns audio file in the requested format as binary data.

Usage Examples

Transcription Example

curl -X POST http://localhost:3000/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F language="en" \
  -F response_format="json"

Translation Example

curl -X POST http://localhost:3000/v1/audio/translations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file="@french_audio.mp3" \
  -F model="whisper-1"

Text-to-Speech Example

curl -X POST http://localhost:3000/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of text-to-speech.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Python Example

import requests

# Transcription
with open("audio.mp3", "rb") as f:
    response = requests.post(
        "http://localhost:3000/v1/audio/transcriptions",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": f},
        data={
            "model": "whisper-1",
            "response_format": "json"
        }
    )
    print(response.json()["text"])

# Text-to-Speech
response = requests.post(
    "http://localhost:3000/v1/audio/speech",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "tts-1",
        "input": "Hello world!",
        "voice": "nova"
    }
)

with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript Example

// Transcription
const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-1');

const transcriptionResponse = await fetch('http://localhost:3000/v1/audio/transcriptions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: formData
});

const { text } = await transcriptionResponse.json();

// Text-to-Speech
const ttsResponse = await fetch('http://localhost:3000/v1/audio/speech', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'tts-1',
    input: 'Hello from JavaScript!',
    voice: 'echo'
  })
});

const audioBlob = await ttsResponse.blob();
const audioUrl = URL.createObjectURL(audioBlob);

Error Responses

400 Bad Request

{
  "error": {
    "message": "Invalid file format",
    "type": "invalid_request_error",
    "code": "invalid_file_format"
  }
}

401 Unauthorized

{
  "error": {
    "message": "Invalid API key",
    "type": "authentication_error",
    "code": "invalid_api_key"
  }
}

413 Request Entity Too Large

{
  "error": {
    "message": "File size exceeds 25MB limit",
    "type": "invalid_request_error",
    "code": "file_too_large"
  }
}

Notes

Maximum file size for audio uploads is 25MB
The whisper-1 model supports multiple languages for transcription
Translation endpoint always outputs English text
Text-to-speech supports multiple voices and output formats
Response formats like SRT and VTT are useful for subtitle generation
Timestamp granularities provide word-level timing information
Speed parameter in TTS allows for faster or slower speech generation

Supported providers

Speech-to-Text (Transcription & Translation)

OpenAI

Industry-leading Whisper models for high-accuracy transcription and translation.

Azure

Microsoft Azure OpenAI Service with enterprise-grade Whisper model support.

Fireworks

Fast and efficient Whisper v3 models for scalable audio processing.

Text-to-Speech

OpenAI

High-quality TTS models with multiple voice options and natural speech synthesis.

Azure

Microsoft Azure OpenAI Service providing enterprise-ready text-to-speech capabilities.

Together

Cartesia Sonic model for efficient and customizable speech generation.

Self-Hosting

Development

OpenAI-Compatible API

Features

Cluster Setup

Endpoints

Speech-to-Text

Text-to-Speech

Authentication

Audio Transcription

Request Format

Response Format

Supported Audio Formats

Audio Translation

Request Format

Response Format

Text-to-Speech

Request Format

Response Format

Usage Examples

Transcription Example

Translation Example

Text-to-Speech Example

Python Example

JavaScript Example

Error Responses

400 Bad Request

401 Unauthorized

413 Request Entity Too Large

Notes

Supported providers

Speech-to-Text (Transcription & Translation)

OpenAI

Azure

Fireworks

Text-to-Speech

OpenAI

Azure

Together

Self-Hosting

Development

OpenAI-Compatible API

Features

Cluster Setup

​Endpoints

​Speech-to-Text

​Text-to-Speech

​Authentication

​Audio Transcription

​Request Format

​Response Format

​Supported Audio Formats

​Audio Translation

​Request Format

​Response Format

​Text-to-Speech

​Request Format

​Response Format

​Usage Examples

​Transcription Example

​Translation Example

​Text-to-Speech Example

​Python Example

​JavaScript Example

​Error Responses

​400 Bad Request

​401 Unauthorized

​413 Request Entity Too Large

​Notes

​Supported providers

​Speech-to-Text (Transcription & Translation)

OpenAI

Azure

Fireworks

​Text-to-Speech

OpenAI

Azure

Together

Endpoints

Speech-to-Text

Text-to-Speech

Authentication

Audio Transcription

Request Format

Response Format

Supported Audio Formats

Audio Translation

Request Format

Response Format

Text-to-Speech

Request Format

Response Format

Usage Examples

Transcription Example

Translation Example

Text-to-Speech Example

Python Example

JavaScript Example

Error Responses

400 Bad Request

401 Unauthorized

413 Request Entity Too Large

Notes

Supported providers

Speech-to-Text (Transcription & Translation)

Text-to-Speech