Evaluation Concepts

Core Objects

Dataset A benchmark definition with prompts, expected behavior, metadata, and estimated token footprint. Trait A capability lens (for example reasoning, safety, domain skill) used to organize datasets and filter selection. Experiment A container for related evaluation runs so teams can compare iterations over time. Run One execution of an evaluation workflow against a selected model + dataset/trait configuration.

Concept Map

Evaluation Surfaces

Evaluations Hub

Search datasets by name and intent.
Filter by traits.
Inspect modality badges and metadata links.

Evaluation Detail

Details: scope, context, and expected behavior.
Leaderboard: ranked model comparison.
Evaluations Explorer: prompt/response and metric-level evidence.

Experiment Workspace

List experiments with status, tags, models, and created date.
Drill into run history and aggregate metrics.

Status Lifecycle

Score Interpretation

Compare models within the same dataset/trait context.
Pair aggregate scores with Explorer evidence before decisions.
Use repeated runs to detect variance and regressions.

Good Practices

Keep a stable baseline experiment for release comparisons.

Tag experiments consistently for filtering and auditability.

Inspect both trait-level and dataset-level views before promoting a model.

Export results for offline review when approvals are required.

Creating Your First EvaluationBuild a practical experiment and execute your first evaluation run

Core Objects
Concept Map
Evaluation Surfaces
Evaluations Hub
Evaluation Detail
Experiment Workspace
Status Lifecycle
Score Interpretation
Good Practices

Getting Started

Projects

Models

Deployments

Pipelines

Clusters

API Integration

Playground

Observability

Dashboard

Prompts & Agents

Evaluations

Guardrails

API Keys & Security

User Management

Customer Dashboard

Settings

Core Objects

Concept Map

Evaluation Surfaces

Evaluations Hub

Evaluation Detail

Experiment Workspace

Status Lifecycle

Score Interpretation

Good Practices

Getting Started

Projects

Models

Deployments

Pipelines

Clusters

API Integration

Playground

Observability

Dashboard

Prompts & Agents

Evaluations

Guardrails

API Keys & Security

User Management

Customer Dashboard

Settings

Documentation Index

​Core Objects

​Concept Map

​Evaluation Surfaces

​Evaluations Hub

​Evaluation Detail

​Experiment Workspace

​Status Lifecycle

​Score Interpretation

​Good Practices

Core Objects

Concept Map

Evaluation Surfaces

Evaluations Hub

Evaluation Detail

Experiment Workspace

Status Lifecycle

Score Interpretation

Good Practices