> ## Documentation Index > Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt > Use this file to discover all available pages before exploring further. # Creating Your First Evaluation > Build a practical experiment and execute your first evaluation run This walkthrough creates a reusable experiment and executes a first benchmark run. ## What You'll Build A baseline experiment that: 1. Evaluates one model against selected traits and datasets. 2. Captures run metrics and status. 3. Produces results for leaderboard comparison. ```mermaid theme={null} flowchart LR A[Create Experiment] --> B[Choose Model] B --> C[Select Traits] C --> D[Select Datasets] D --> E[Run Evaluation] E --> F[Analyze + Rerun] ``` ## Prerequisites * Access to Bud AI Foundry with Evaluations enabled. * At least one model or deployment available for testing. * Clear evaluation goal (for example: quality gate for release candidate). ## Step 1: Create Experiment Container 1. Open **Evaluations → Experiments**. 2. Click **New experiment**. 3. Add name, description, and tags. 4. Save. ## Step 2: Configure Evaluation Run 1. Open the experiment detail page. 2. Click **Run Evaluation**. 3. Choose: * Model target * Relevant traits * Datasets corresponding to those traits 4. Review selection scope.

## Step 3: Launch and Observe 1. Submit the run. 2. Track state transitions (queued/running/completed/failed). 3. On completion, review: * Overall benchmark summary * Current metrics by trait * Run timing and status fields ## Step 4: Validate with Dataset Evidence 1. Open the evaluated dataset page. 2. Check **Leaderboard** for rank context. 3. Open **Evaluations Explorer** to verify sample-level behavior. ## Step 5: Iterate * Rerun with different model configurations. * Keep the same experiment for apples-to-apples comparisons. * Use tags for milestone checkpoints like `rc1`, `rc2`, or `prod-candidate`. ## Common First-Run Checklist Selected model is correct and reachable. Traits match the capability you want to measure. Datasets align with domain and modality needs. Run status and duration are captured for each attempt. Final decision includes Explorer evidence, not only aggregate score.