> ## Documentation Index
> Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Creating Your First Evaluation

> Build a practical experiment and execute your first evaluation run

This walkthrough creates a reusable experiment and executes a first benchmark run.

## What You'll Build

A baseline experiment that:

1. Evaluates one model against selected traits and datasets.
2. Captures run metrics and status.
3. Produces results for leaderboard comparison.

```mermaid theme={null}
flowchart LR
    A[Create Experiment] --> B[Choose Model]
    B --> C[Select Traits]
    C --> D[Select Datasets]
    D --> E[Run Evaluation]
    E --> F[Analyze + Rerun]
```

## Prerequisites

* Access to Bud AI Foundry with Evaluations enabled.
* At least one model or deployment available for testing.
* Clear evaluation goal (for example: quality gate for release candidate).

## Step 1: Create Experiment Container

1. Open **Evaluations → Experiments**.
2. Click **New experiment**.
3. Add name, description, and tags.
4. Save.

## Step 2: Configure Evaluation Run

1. Open the experiment detail page.
2. Click **Run Evaluation**.
3. Choose:
   * Model target
   * Relevant traits
   * Datasets corresponding to those traits
4. Review selection scope.

<img src="https://mintcdn.com/budecosystem-b7b14df4/_F9ci6HtGgIHUL90/images/image-58.png?fit=max&auto=format&n=_F9ci6HtGgIHUL90&q=85&s=5791cfa3406279820b341f78bd5cfb07" alt="Image" width="1920" height="877" data-path="images/image-58.png" />

<img src="https://mintcdn.com/budecosystem-b7b14df4/_F9ci6HtGgIHUL90/images/image-59.png?fit=max&auto=format&n=_F9ci6HtGgIHUL90&q=85&s=4ac3879726f0087fa255c9e39009d7fa" alt="Image" width="1920" height="879" data-path="images/image-59.png" />

## Step 3: Launch and Observe

1. Submit the run.
2. Track state transitions (queued/running/completed/failed).
3. On completion, review:
   * Overall benchmark summary
   * Current metrics by trait
   * Run timing and status fields

## Step 4: Validate with Dataset Evidence

1. Open the evaluated dataset page.
2. Check **Leaderboard** for rank context.
3. Open **Evaluations Explorer** to verify sample-level behavior.

## Step 5: Iterate

* Rerun with different model configurations.
* Keep the same experiment for apples-to-apples comparisons.
* Use tags for milestone checkpoints like `rc1`, `rc2`, or `prod-candidate`.

## Common First-Run Checklist

<Check>
  Selected model is correct and reachable.
</Check>

<Check>
  Traits match the capability you want to measure.
</Check>

<Check>
  Datasets align with domain and modality needs.
</Check>

<Check>
  Run status and duration are captured for each attempt.
</Check>

<Check>
  Final decision includes Explorer evidence, not only aggregate score.
</Check>
