Endpoint
Authentication
Request Body
Required Fields
| Field | Type | Description |
|---|---|---|
model | string | Model identifier. It could be your deployment name, adapter name, routing etc. |
messages | array | Array of message objects forming the conversation |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
temperature | float | Model default | Sampling temperature (0.0 to 2.0) |
max_tokens | integer | Model default | Maximum tokens to generate |
max_completion_tokens | integer | Model default | Alternative to max_tokens (OpenAI compatibility) |
top_p | float | Model default | Nucleus sampling parameter |
frequency_penalty | float | 0.0 | Penalize repeated tokens (-2.0 to 2.0) |
presence_penalty | float | 0.0 | Penalize tokens based on presence (-2.0 to 2.0) |
seed | integer | null | Random seed for reproducibility |
stream | boolean | false | Enable streaming response |
stream_options | object | null | Streaming configuration |
logprobs | boolean | false | Return token log probabilities |
response_format | object | null | Output format control |
tools | array | null | Available tool/function definitions |
tool_choice | string/object | ”auto” | Tool selection strategy |
parallel_tool_calls | boolean | true | Allow parallel tool calls |
Additional fields
| Field | Type | Description |
|---|---|---|
chat_template | string | Custom chat template |
chat_template_kwargs | object | Template parameters (e.g., {"enable_thinking": true}) |
mm_processor_kwargs | object | Multi-modal processor parameters |
guided_json | object | JSON schema for guided generation |
guided_regex | string | Regex pattern for guided generation |
guided_choice | array | List of allowed values |
guided_grammar | string | Grammar for guided generation |
structural_tag | string | Structural generation tag |
guided_decoding_backend | string | Backend for guided decoding |
guided_whitespace_pattern | string | Whitespace pattern for guided generation |
Message Object Format
Content Block Types
Text Content:Response Format
Standard Response
Streaming Response
Whenstream: true, returns Server-Sent Events (SSE):
Usage Examples
Basic Chat Completion
With Tool/Function Calling
With Guided JSON Generation
With Streaming
With Multi-modal Content
Response Fields
| Field | Type | Description |
|---|---|---|
id | string | Unique inference ID (UUID) |
episode_id | string | Episode ID for grouped inferences |
choices | array | Array of completion choices (usually 1) |
choices[].index | integer | Choice index (always 0 for single completion) |
choices[].finish_reason | string | Reason for completion: stop, length, content_filter, tool_calls |
choices[].message | object | Generated message object |
choices[].message.role | string | Always “assistant” |
choices[].message.content | string/null | Generated text content |
choices[].message.tool_calls | array | Tool/function calls if any |
choices[].message.reasoning_content | string/null | Reasoning/thinking content |
choices[].logprobs | object/null | Token log probabilities if requested |
created | integer | Unix timestamp |
model | string | Model identifier used |
usage | object | Token usage statistics |
usage.prompt_tokens | integer | Input token count |
usage.completion_tokens | integer | Output token count |
usage.total_tokens | integer | Total tokens used |
Error Responses
400- Invalid request format or parameters401- Authentication failed404- Function or model not found429- Rate limit exceeded500- Internal server error
Supported providers
OpenAI
Offers advanced models for language, image generation, and audio conversion.
Anthropic
Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems.
Together.AI
Offers Llama, Falcon, Alpaca, and other Chat and Code LLMs, along with language/instruction models.
Deepseek
Advanced models like DeepSeek LLM, Coder, Math, VL for various AI-driven tasks, from coding to math reasoning and vision-language applications.
Fireworks AI
The fastest and most efficient inference engine to build production-ready, compound AI systems.
AWS Bedrock
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs).
AWS Sagemaker
Supports Jumpstart Huggingface embedding models and Meta’s Llama series.
GCP Vertex AI
Designed for text generation and language understanding in chatbots and content automation.
Google AI Studio
Access Google’s Gemini Pro, 1.5 Pro, and Pro Vision models for multimodal AI.
Mistral AI
Provides embedding, coding, edge, math, image, and multimodal reasoning models.
Hyperbolic
High-performance inference platform for running open-source AI models at scale.
XAI
xAI’s inference provider allows access to their large language models, like Grok, for generating text and other AI outputs.