If your eval loop is "I tried it once, it looked good," your prompt is one re-roll away from breaking — and you won't notice until someone else runs it.
You don't need an eval framework. You need three habits.
Three habits
- Run it five times. Same inputs, five generations. If the outputs vary in shape (not just wording), the prompt isn't done. The pattern needs more constraints.
- Try the inputs you'd never use. Whatever your prompt is for, write three variations of the input that you'd consider edge cases — empty fields, conflicting values, the long-tail topic you weren't thinking about. Watch what falls over.
- Save the best output as a benchmark. When you tweak the prompt later, compare new outputs to that benchmark. "Better" without a benchmark is vibes; with one, it's a decision.
What to actually look at
You're not grading prose. You're grading whether the output is:
- The right shape. Did it produce a list when you asked for a list?
- Internally consistent. Does the second half contradict the first?
- Specific to the input. Or is it the same generic answer regardless of what you fed in?
Most "good" prompts fail one of these. Catching that before someone else does is the whole job.