Skip to main content

I stopped guessing whether my prompting was any good and started scoring it

Prompt evaluation intro.

1 min read

My prompting process was: tweak the prompt, look at one or two outputs, decide it "looks better", move on.

Then, after learning more how AI works under the hood I started evaluating my prompts.
This is my loop:

  • Write the prompt as a template with variables.
  • Build 5–10 test cases (inputs + what a good output looks like).
  • Run the prompt on all of them, score each output 0–10.
  • Average the score.
  • Improve the prompt. Re-run. Compare.

My first baseline (average score) was embarrassing: 2.32/10 on a prompt I thought was fine.

Two iterations later, the score increased significantly: 7.86. And I knew exactly which change caused which jump.

The biggest surprise wasn't the score, it was the per-case failures. The prompt didn't fail randomly, it failed the same 3 types of input every time.

Off course I don't do this every time because not all use-cases need prompt evaluation but, I do it when I need very good outputs from my AI agents.