Skip to main content

Documentation Index

Fetch the complete documentation index at: https://budecosystem-b7b14df4.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Comparison Framework

1) Keep Comparisons Fair

  • Use the same trait and dataset set for each model.
  • Group runs under one experiment for traceability.
  • Avoid comparing scores across unrelated datasets.

2) Use Leaderboard for Ranking

Leaderboard helps identify top-performing models quickly:
  • Compare relative ordering.
  • Look for score gaps, not only rank position.
  • Re-check runs with small score differences.

3) Use Explorer for Qualitative Validation

After ranking, inspect sample-level outputs:
  • Validate prompt understanding.
  • Check response consistency.
  • Confirm failures are acceptable for your use case.

4) Track Operational Signals

Include non-score context from run history:
  • Run duration
  • Completion/failure frequency
  • Trait-level variance across reruns

Decision Matrix

SignalStrong Candidate Indicator
Leaderboard scoreHigh and stable across reruns
Explorer qualityFewer critical errors on key samples
Run reliabilityCompleted runs with low failure rate
Trait coverageGood results across required traits
If two models are close on score, prioritize the one with more stable outputs and lower operational risk.