Skip to main content

Goal

Set up a repeatable experiment to choose the best model for a customer-support summarization use case.

Step 1: Define Evaluation Criteria

Use simple scoring dimensions:
  • Accuracy (0-5)
  • Instruction adherence (0-5)
  • Clarity/formatting (0-5)
  • Response time (fast/medium/slow)

Step 2: Create Baseline Prompt

Example prompt:
Summarize the following support ticket in exactly 3 bullets:
- Problem
- Business impact
- Recommended next action

Step 3: Run Across Two Models

  1. Open two chat panes.
  2. Bind each pane to a different model.
  3. Send the same baseline prompt.
  4. Capture outputs and latency observations.

Step 4: Tune Parameters

Adjust one variable at a time:
  • Temperature
  • Max response length
  • Stop conditions
Re-run and compare with previous outputs.

Step 5: Save the Winner

  1. Keep the best conversation in history.
  2. Note final prompt and parameter values.
  3. Share results with deployment owners before production rollout.

Expected Outcome

At the end of this workflow, you should have:
  • A validated prompt template
  • A preferred model choice for the task
  • Reproducible settings for follow-up testing