Skip to main content

Endpoint

POST /v1/chat/completions

Authentication

Authorization: Bearer <API_KEY>

Request Body

Required Fields

FieldTypeDescription
modelstringModel identifier. It could be your deployment name, adapter name, routing etc.
messagesarrayArray of message objects forming the conversation

Optional Fields

FieldTypeDefaultDescription
temperaturefloatModel defaultSampling temperature (0.0 to 2.0)
max_tokensintegerModel defaultMaximum tokens to generate
max_completion_tokensintegerModel defaultAlternative to max_tokens (OpenAI compatibility)
top_pfloatModel defaultNucleus sampling parameter
frequency_penaltyfloat0.0Penalize repeated tokens (-2.0 to 2.0)
presence_penaltyfloat0.0Penalize tokens based on presence (-2.0 to 2.0)
seedintegernullRandom seed for reproducibility
streambooleanfalseEnable streaming response
stream_optionsobjectnullStreaming configuration
logprobsbooleanfalseReturn token log probabilities
response_formatobjectnullOutput format control
toolsarraynullAvailable tool/function definitions
tool_choicestring/object”auto”Tool selection strategy
parallel_tool_callsbooleantrueAllow parallel tool calls

Additional fields

FieldTypeDescription
chat_templatestringCustom chat template
chat_template_kwargsobjectTemplate parameters (e.g., {"enable_thinking": true})
mm_processor_kwargsobjectMulti-modal processor parameters
guided_jsonobjectJSON schema for guided generation
guided_regexstringRegex pattern for guided generation
guided_choicearrayList of allowed values
guided_grammarstringGrammar for guided generation
structural_tagstringStructural generation tag
guided_decoding_backendstringBackend for guided decoding
guided_whitespace_patternstringWhitespace pattern for guided generation

Message Object Format

{
  "role": "system" | "user" | "assistant" | "tool",
  "content": "string" | [{...}],  // String or array of content blocks
  "tool_calls": [...],             // For assistant messages
  "tool_call_id": "string"         // For tool messages
}

Content Block Types

Text Content:
{
  "type": "text",
  "text": "Your message here"
}
Image Content:
{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.jpg"  // or data:image/jpeg;base64,...
  }
}
File Content:
{
  "type": "file",
  "file": {
    "file_data": "data:application/pdf;base64,...",
    "filename": "document.pdf"
  }
}

Response Format

Standard Response

{
  "id": "01977ed9-7492-7b70-8347-955764f97b3d",
  "episode_id": "01977ed9-7492-7b70-8347-9564aeb44a24",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Response text here",
        "tool_calls": [],
        "reasoning_content": "Optional reasoning/thinking content"
      },
      "logprobs": null
    }
  ],
  "created": 1750179872,
  "model": "qwen_3_4b",
  "system_fingerprint": "",
  "service_tier": "",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 133,
    "total_tokens": 145
  }
}

Streaming Response

When stream: true, returns Server-Sent Events (SSE):
data: {"id":"...","object":"chat.completion.chunk","created":1750179872,"model":"...","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"...","object":"chat.completion.chunk","created":1750179872,"model":"...","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"...","object":"chat.completion.chunk","created":1750179872,"model":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":5,"total_tokens":15}}

data: [DONE]

Usage Examples

Basic Chat Completion

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "qwen3-4b",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

With Tool/Function Calling

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "tensorzero::function_name::assistant",
    "messages": [
      {"role": "user", "content": "What is the weather in San Francisco?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get weather information",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string"},
              "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

With Guided JSON Generation

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "tensorzero::model_name::vllm_model",
    "messages": [
      {"role": "user", "content": "Extract person information"}
    ],
    "guided_json": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string", "format": "email"}
      },
      "required": ["name", "age"]
    }
  }'

With Streaming

const response = await fetch('http://localhost:3000/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: JSON.stringify({
    model: 'qwen3-4b',
    messages: [{role: 'user', content: 'Tell me a story'}],
    stream: true,
    stream_options: {include_usage: true}
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const {done, value} = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n');

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = line.slice(6);
      if (data === '[DONE]') break;

      const parsed = JSON.parse(data);
      if (parsed.choices[0].delta.content) {
        process.stdout.write(parsed.choices[0].delta.content);
      }
    }
  }
}

With Multi-modal Content

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "qwen2-7b-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "..."
            }
          }
        ]
      }
    ]
  }'

Response Fields

FieldTypeDescription
idstringUnique inference ID (UUID)
episode_idstringEpisode ID for grouped inferences
choicesarrayArray of completion choices (usually 1)
choices[].indexintegerChoice index (always 0 for single completion)
choices[].finish_reasonstringReason for completion: stop, length, content_filter, tool_calls
choices[].messageobjectGenerated message object
choices[].message.rolestringAlways “assistant”
choices[].message.contentstring/nullGenerated text content
choices[].message.tool_callsarrayTool/function calls if any
choices[].message.reasoning_contentstring/nullReasoning/thinking content
choices[].logprobsobject/nullToken log probabilities if requested
createdintegerUnix timestamp
modelstringModel identifier used
usageobjectToken usage statistics
usage.prompt_tokensintegerInput token count
usage.completion_tokensintegerOutput token count
usage.total_tokensintegerTotal tokens used

Error Responses

{
  "error": {
    "message": "Error description",
    "type": "invalid_request_error",
    "code": 400
  }
}
Common error codes:
  • 400 - Invalid request format or parameters
  • 401 - Authentication failed
  • 404 - Function or model not found
  • 429 - Rate limit exceeded
  • 500 - Internal server error

Supported providers

OpenAI

Offers advanced models for language, image generation, and audio conversion.

Anthropic

Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems.

Together.AI

Offers Llama, Falcon, Alpaca, and other Chat and Code LLMs, along with language/instruction models.

Deepseek

Advanced models like DeepSeek LLM, Coder, Math, VL for various AI-driven tasks, from coding to math reasoning and vision-language applications.

Fireworks AI

The fastest and most efficient inference engine to build production-ready, compound AI systems.

AWS Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs).

AWS Sagemaker

Supports Jumpstart Huggingface embedding models and Meta’s Llama series.

GCP Vertex AI

Designed for text generation and language understanding in chatbots and content automation.

Google AI Studio

Access Google’s Gemini Pro, 1.5 Pro, and Pro Vision models for multimodal AI.

Mistral AI

Provides embedding, coding, edge, math, image, and multimodal reasoning models.

Hyperbolic

High-performance inference platform for running open-source AI models at scale.

XAI

xAI’s inference provider allows access to their large language models, like Grok, for generating text and other AI outputs.