Does your prompt actualy work?
Testing a prompt on one example is like shipping code with one unit test.
Here's something you might find controversial:
A prompt that nails the example you tried is not a good prompt. It's a prompt that nails one example.
Wait... what???
LLMs are non-deterministic and input-sensitive. The same prompt can be great on short inputs and fall apart on long ones, edge cases, weird formats, or empty fields.
So how can you influence the results? I'm glad you asked :)
By building a small dataset of test cases before judging any prompt change. For example, mine usually look like this:
- 5 typical inputs
- 5 edge cases (too long, too short, ambiguous, wrong language)
- 5 adversarial ones (input tries to override instructions)
If a "better" prompt wins on 1 example but loses on 4 edge cases, it's worse. You just can't see it without the dataset.
I didn't know these techniques even existed until I did Anthropic's AI architect courses.
Have you ever tried running test cases before trusting a prompt?