Skip to main content

Overview

Use this guide to compare models fairly and produce decisions that are easy to defend with product and engineering teams.

Best-Practice Comparison Setup

  1. Use the same prompt set for all models.
  2. Keep parameter settings equivalent unless intentionally testing defaults.
  3. Score outputs with a fixed rubric.
  4. Repeat tests at least twice for consistency checks.

Suggested Rubric

CriterionQuestion
CorrectnessIs the response factually and logically correct?
Instruction FollowingDid it obey format and constraints?
ToneDoes it match the desired style/voice?
LatencyIs response speed acceptable for UX goals?

Anti-Patterns to Avoid

  • Comparing different prompts across models.
  • Changing multiple parameters at once.
  • Choosing solely by “most verbose” output.
  • Ignoring latency when UX is real-time.

Deliverable Template

Conclude with:
  • Winning model
  • Why it won (rubric summary)
  • Trade-offs and fallback option