Skip to main content
Use this guide to quickly diagnose issues while running evaluation experiments.

Quick Triage Flow

Dataset Discovery Issues

No datasets shown in Evaluations Hub

Possible causes
  • Search query is too restrictive.
  • Trait filters exclude all results.
Fixes
  1. Clear search input.
  2. Remove all trait filters.
  3. Reapply filters one by one.

Run Launch Issues

Run Evaluation button does not complete a run

Possible causes
  • Required model or dataset selection is missing.
  • Selected configuration is invalid for the chosen scope.
Fixes
  1. Reopen run form and verify all selections.
  2. Start with one trait and one dataset.
  3. Retry with a known-good model target.

Result Interpretation Issues

Leaderboard has no useful comparison

Possible causes
  • Too few completed runs.
  • Models were evaluated on different scopes.
Fixes
  1. Rerun candidates on the same traits/datasets.
  2. Keep all comparisons in one experiment.

Explorer data appears inconsistent with score

Possible causes
  • Sampling differences across runs.
  • Score is aggregate while Explorer is row-level.
Fixes
  1. Review multiple rows, not a single sample.
  2. Rerun to confirm consistency.

Experiment Management Issues

Hard to locate the right experiment

Fixes
  • Use standardized tags and naming.
  • Sort by creation date and filter by status/model.

Too many failed runs

Fixes
  • Reduce scope (fewer traits/datasets) to isolate failure.
  • Rerun incrementally after each configuration change.

Escalation Checklist

Before escalating internally, collect:
  • Experiment name and run timestamp.
  • Model, traits, and datasets selected.
  • Observed status and screenshots of key tabs.
  • Whether issue reproduces after rerun.