Skip to main content

Endpoints

Speech-to-Text

POST /v1/audio/transcriptions
POST /v1/audio/translations

Text-to-Speech

POST /v1/audio/speech

Authentication

Authorization: Bearer <API_KEY>

Audio Transcription

Converts audio files to text in the original language.

Request Format

Headers:
  • Authorization: Bearer YOUR_API_KEY (required)
  • Content-Type: multipart/form-data (required)
Form Data:
FieldTypeRequiredDescription
filefileYesAudio file (max 25MB)
modelstringYesModel identifier (e.g., whisper-1)
languagestringNoISO-639-1 language code
promptstringNoOptional context/previous transcript
response_formatstringNoFormat: json, text, srt, verbose_json, vtt (default: json)
temperaturefloatNoSampling temperature (0.0 to 1.0)
timestamp_granularitiesarrayNoTimestamp levels: word, segment

Response Format

JSON Response:
{
  "text": "Transcribed text here"
}
Verbose JSON Response:
{
  "task": "transcribe",
  "language": "en",
  "duration": 5.32,
  "text": "Transcribed text here",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Segment text",
      "tokens": [50364, 6550, ...],
      "temperature": 0.0,
      "avg_logprob": -0.23,
      "compression_ratio": 1.2,
      "no_speech_prob": 0.01
    }
  ]
}

Supported Audio Formats

  • flac
  • mp3
  • mp4
  • mpeg
  • mpga
  • m4a
  • ogg
  • wav
  • webm

Audio Translation

Converts audio files to English text.

Request Format

Same as transcription endpoint, but always outputs English text regardless of input language.

Response Format

Same as transcription endpoint.

Text-to-Speech

Generates audio from text input.

Request Format

Headers:
  • Authorization: Bearer YOUR_API_KEY (required)
  • Content-Type: application/json (required)
Request Body:
{
  "model": "tts-1",
  "input": "Text to convert to speech",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}
FieldTypeRequiredDescription
modelstringYesModel: tts-1 or tts-1-hd
inputstringYesText to generate audio for (max 4096 chars)
voicestringYesVoice: alloy, echo, fable, onyx, nova, shimmer
response_formatstringNoFormat: mp3, opus, aac, flac, wav, pcm (default: mp3)
speedfloatNoSpeed: 0.25 to 4.0 (default: 1.0)

Response Format

Returns audio file in the requested format as binary data.

Usage Examples

Transcription Example

curl -X POST http://localhost:3000/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F language="en" \
  -F response_format="json"

Translation Example

curl -X POST http://localhost:3000/v1/audio/translations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file="@french_audio.mp3" \
  -F model="whisper-1"

Text-to-Speech Example

curl -X POST http://localhost:3000/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of text-to-speech.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Python Example

import requests

# Transcription
with open("audio.mp3", "rb") as f:
    response = requests.post(
        "http://localhost:3000/v1/audio/transcriptions",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": f},
        data={
            "model": "whisper-1",
            "response_format": "json"
        }
    )
    print(response.json()["text"])

# Text-to-Speech
response = requests.post(
    "http://localhost:3000/v1/audio/speech",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "tts-1",
        "input": "Hello world!",
        "voice": "nova"
    }
)

with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript Example

// Transcription
const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-1');

const transcriptionResponse = await fetch('http://localhost:3000/v1/audio/transcriptions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: formData
});

const { text } = await transcriptionResponse.json();

// Text-to-Speech
const ttsResponse = await fetch('http://localhost:3000/v1/audio/speech', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'tts-1',
    input: 'Hello from JavaScript!',
    voice: 'echo'
  })
});

const audioBlob = await ttsResponse.blob();
const audioUrl = URL.createObjectURL(audioBlob);

Error Responses

400 Bad Request

{
  "error": {
    "message": "Invalid file format",
    "type": "invalid_request_error",
    "code": "invalid_file_format"
  }
}

401 Unauthorized

{
  "error": {
    "message": "Invalid API key",
    "type": "authentication_error",
    "code": "invalid_api_key"
  }
}

413 Request Entity Too Large

{
  "error": {
    "message": "File size exceeds 25MB limit",
    "type": "invalid_request_error",
    "code": "file_too_large"
  }
}

Notes

  • Maximum file size for audio uploads is 25MB
  • The whisper-1 model supports multiple languages for transcription
  • Translation endpoint always outputs English text
  • Text-to-speech supports multiple voices and output formats
  • Response formats like SRT and VTT are useful for subtitle generation
  • Timestamp granularities provide word-level timing information
  • Speed parameter in TTS allows for faster or slower speech generation

Supported providers

Speech-to-Text (Transcription & Translation)

OpenAI

Industry-leading Whisper models for high-accuracy transcription and translation.

Azure

Microsoft Azure OpenAI Service with enterprise-grade Whisper model support.

Fireworks

Fast and efficient Whisper v3 models for scalable audio processing.

Text-to-Speech

OpenAI

High-quality TTS models with multiple voice options and natural speech synthesis.

Azure

Microsoft Azure OpenAI Service providing enterprise-ready text-to-speech capabilities.

Together

Cartesia Sonic model for efficient and customizable speech generation.