Skip to main content

Documentation Index

Fetch the complete documentation index at: https://budecosystem-b7b14df4.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This walkthrough creates a reusable experiment and executes a first benchmark run.

What You’ll Build

A baseline experiment that:
  1. Evaluates one model against selected traits and datasets.
  2. Captures run metrics and status.
  3. Produces results for leaderboard comparison.

Prerequisites

  • Access to Bud AI Foundry with Evaluations enabled.
  • At least one model or deployment available for testing.
  • Clear evaluation goal (for example: quality gate for release candidate).

Step 1: Create Experiment Container

  1. Open Evaluations → Experiments.
  2. Click New experiment.
  3. Add name, description, and tags.
  4. Save.

Step 2: Configure Evaluation Run

  1. Open the experiment detail page.
  2. Click Run Evaluation.
  3. Choose:
    • Model target
    • Relevant traits
    • Datasets corresponding to those traits
  4. Review selection scope.
Image Image

Step 3: Launch and Observe

  1. Submit the run.
  2. Track state transitions (queued/running/completed/failed).
  3. On completion, review:
    • Overall benchmark summary
    • Current metrics by trait
    • Run timing and status fields

Step 4: Validate with Dataset Evidence

  1. Open the evaluated dataset page.
  2. Check Leaderboard for rank context.
  3. Open Evaluations Explorer to verify sample-level behavior.

Step 5: Iterate

  • Rerun with different model configurations.
  • Keep the same experiment for apples-to-apples comparisons.
  • Use tags for milestone checkpoints like rc1, rc2, or prod-candidate.

Common First-Run Checklist

Selected model is correct and reachable.
Traits match the capability you want to measure.
Datasets align with domain and modality needs.
Run status and duration are captured for each attempt.
Final decision includes Explorer evidence, not only aggregate score.