> ## Documentation Index
> Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation Concepts

> Understand datasets, traits, experiments, and scoring workflows

## Core Objects

**Dataset**
A benchmark definition with prompts, expected behavior, metadata, and estimated token footprint.

**Trait**
A capability lens (for example reasoning, safety, domain skill) used to organize datasets and filter selection.

**Experiment**
A container for related evaluation runs so teams can compare iterations over time.

**Run**
One execution of an evaluation workflow against a selected model + dataset/trait configuration.

## Concept Map

```mermaid theme={null}
flowchart LR
    T[Trait] --> D[Dataset]
    D --> E[Experiment]
    E --> R[Run]
    R --> S[Scores]
    S --> L[Leaderboard/Explorer]
```

## Evaluation Surfaces

### Evaluations Hub

* Search datasets by name and intent.
* Filter by traits.
* Inspect modality badges and metadata links.

### Evaluation Detail

* **Details**: scope, context, and expected behavior.
* **Leaderboard**: ranked model comparison.
* **Evaluations Explorer**: prompt/response and metric-level evidence.

### Experiment Workspace

* List experiments with status, tags, models, and created date.
* Drill into run history and aggregate metrics.

## Status Lifecycle

```mermaid theme={null}
stateDiagram-v2
    [*] --> Draft
    Draft --> Running: submit run
    Running --> Completed: success
    Running --> Failed: error/timeout
    Failed --> Running: rerun
    Completed --> Running: new configuration
```

## Score Interpretation

* Compare models within the same dataset/trait context.
* Pair aggregate scores with Explorer evidence before decisions.
* Use repeated runs to detect variance and regressions.

## Good Practices

<Check>Keep a stable baseline experiment for release comparisons.</Check>
<Check>Tag experiments consistently for filtering and auditability.</Check>
<Check>Inspect both trait-level and dataset-level views before promoting a model.</Check>
<Check>Export results for offline review when approvals are required.</Check>
