Use this guide to quickly diagnose issues while running evaluation experiments.
Quick Triage Flow
Dataset Discovery Issues
No datasets shown in Evaluations Hub
Possible causes- Search query is too restrictive.
- Trait filters exclude all results.
- Clear search input.
- Remove all trait filters.
- Reapply filters one by one.
Run Launch Issues
Run Evaluation button does not complete a run
Possible causes- Required model or dataset selection is missing.
- Selected configuration is invalid for the chosen scope.
- Reopen run form and verify all selections.
- Start with one trait and one dataset.
- Retry with a known-good model target.
Result Interpretation Issues
Leaderboard has no useful comparison
Possible causes- Too few completed runs.
- Models were evaluated on different scopes.
- Rerun candidates on the same traits/datasets.
- Keep all comparisons in one experiment.
Explorer data appears inconsistent with score
Possible causes- Sampling differences across runs.
- Score is aggregate while Explorer is row-level.
- Review multiple rows, not a single sample.
- Rerun to confirm consistency.
Experiment Management Issues
Hard to locate the right experiment
Fixes- Use standardized tags and naming.
- Sort by creation date and filter by status/model.
Too many failed runs
Fixes- Reduce scope (fewer traits/datasets) to isolate failure.
- Rerun incrementally after each configuration change.
Escalation Checklist
Before escalating internally, collect:- Experiment name and run timestamp.
- Model, traits, and datasets selected.
- Observed status and screenshots of key tabs.
- Whether issue reproduces after rerun.