Overview
Use this guide to compare models fairly and produce decisions that are easy to defend with product and engineering teams.Best-Practice Comparison Setup
- Use the same prompt set for all models.
- Keep parameter settings equivalent unless intentionally testing defaults.
- Score outputs with a fixed rubric.
- Repeat tests at least twice for consistency checks.
Suggested Rubric
| Criterion | Question |
|---|---|
| Correctness | Is the response factually and logically correct? |
| Instruction Following | Did it obey format and constraints? |
| Tone | Does it match the desired style/voice? |
| Latency | Is response speed acceptable for UX goals? |
Anti-Patterns to Avoid
- Comparing different prompts across models.
- Changing multiple parameters at once.
- Choosing solely by “most verbose” output.
- Ignoring latency when UX is real-time.
Deliverable Template
Conclude with:- Winning model
- Why it won (rubric summary)
- Trade-offs and fallback option