Troubleshooting - Bud Stack Documentation

Use this guide to quickly diagnose issues while running evaluation experiments.

Quick Triage Flow

Dataset Discovery Issues

No datasets shown in Evaluations Hub

Possible causes

Search query is too restrictive.
Trait filters exclude all results.

Fixes

Clear search input.
Remove all trait filters.
Reapply filters one by one.

Run Launch Issues

Run Evaluation button does not complete a run

Possible causes

Required model or dataset selection is missing.
Selected configuration is invalid for the chosen scope.

Fixes

Reopen run form and verify all selections.
Start with one trait and one dataset.
Retry with a known-good model target.

Result Interpretation Issues

Leaderboard has no useful comparison

Possible causes

Too few completed runs.
Models were evaluated on different scopes.

Fixes

Rerun candidates on the same traits/datasets.
Keep all comparisons in one experiment.

Explorer data appears inconsistent with score

Possible causes

Sampling differences across runs.
Score is aggregate while Explorer is row-level.

Fixes

Review multiple rows, not a single sample.
Rerun to confirm consistency.

Experiment Management Issues

Hard to locate the right experiment

Fixes

Use standardized tags and naming.
Sort by creation date and filter by status/model.

Too many failed runs

Fixes

Reduce scope (fewer traits/datasets) to isolate failure.
Rerun incrementally after each configuration change.

Escalation Checklist

Before escalating internally, collect:

Experiment name and run timestamp.
Model, traits, and datasets selected.
Observed status and screenshots of key tabs.
Whether issue reproduces after rerun.

Introduction to GuardrailBuild safer AI applications with policy-driven guardrail workflows

Quick Triage Flow
Dataset Discovery Issues
No datasets shown in Evaluations Hub
Run Launch Issues
Run Evaluation button does not complete a run
Result Interpretation Issues
Leaderboard has no useful comparison
Explorer data appears inconsistent with score
Experiment Management Issues
Hard to locate the right experiment
Too many failed runs
Escalation Checklist