I stopped guessing whether my prompting was any good and started scoring it

Prompt evaluation intro.

June 17, 20261 min read

My prompting process was: tweak the prompt, look at one or two outputs, decide it "looks better", move on.

Then, after learning more how AI works under the hood I started evaluating my prompts.
This is my loop:

My first baseline (average score) was embarrassing: 2.32/10 on a prompt I thought was fine.

Two iterations later, the score increased significantly: 7.86. And I knew exactly which change caused which jump.

The biggest surprise wasn't the score, it was the per-case failures. The prompt didn't fail randomly, it failed the same 3 types of input every time.

Off course I don't do this every time because not all use-cases need prompt evaluation but, I do it when I need very good outputs from my AI agents.