Evals for LLM apps: from vibes to numbers

If you've shipped an LLM feature, you know the pattern. The team does manual checks, the demos look fine, you ship it. Two weeks later, someone complains it gave a wrong answer. You change the prompt. Now another stakeholder complains a different thing got worse.

Without evals, you can't tell if changes are improvements. You're navigating by vibes.

Here's the eval setup I use in production. Not academic — it survives real teams.

The minimum viable eval harness

Three things, only three:

A frozen test set. 30–100 representative inputs with known-good outputs (or a rubric for what "good" looks like). Versioned in git.
A scoring function. Could be exact match, semantic similarity, or LLM-as-judge. Pick one and keep it consistent.
A runner. Anything that takes a commit, runs your LLM pipeline against the test set, outputs a score.

That's it. Build the whole thing in 200 lines of Python and a YAML file. Resist the urge to adopt a fancy eval platform until you actually need one.

Picking your test set

Common mistake: build the test set from happy-path examples. The team eyeballs 50 successful conversations and freezes them.

Don't do this. Your test set should over-represent failure modes:

Inputs at the edge of your scope (questions about adjacent topics).
Inputs in your scope but unusual phrasing.
Inputs that historically tripped up the model.
Inputs from real production logs, anonymized.
Adversarial inputs (jailbreak attempts, prompt injection).

A test set of 50 thoughtfully chosen inputs beats 1000 random ones.

The scoring function

For tasks with one right answer (classification, extraction): exact match or JSON schema validation.

For open-ended tasks (chat, summarization, RAG answers): LLM-as-judge with a clear rubric.

LLM-as-judge prompt template I use:

You are evaluating an AI assistant's response.

QUESTION: {question}
EXPECTED OUTCOME: {expected}
ACTUAL RESPONSE: {response}

Score on a 0-3 scale:
3 = Correct, complete, well-grounded.
2 = Mostly correct, minor missing detail OR one minor unsupported claim.
1 = Partially correct but contains material error or unsupported claim.
0 = Wrong, hallucinated, or improperly refuses.

Respond as JSON: {"score": 0-3, "reasoning": "..."}

Use a different (often cheaper) model as the judge than your application uses. Counterintuitively, gpt-4o-mini is a perfectly good judge for many tasks.

Reducing judge variance

LLM judges are noisy. Mitigations:

Run 3 times, average. Variance drops dramatically with even 3 samples.
Use temperature 0. Yes, even 0 is non-deterministic for some models, but lower variance.
Calibrate against humans periodically. Have an expert score 30 examples; check that the judge's scores correlate.

Wiring it into CI

- name: Run eval
  run: python evals/run.py --commit ${{ github.sha }} > eval_results.json

- name: Compare to baseline
  run: |
    python evals/compare.py \
      --current eval_results.json \
      --baseline evals/baseline.json \
      --threshold -0.05

The compare.py script exits non-zero if the score drops by more than the threshold (I usually start at -0.05). Branch policy: build must pass to merge.

This is the rule that actually moves the needle. Regressions don't ship.

What goes wrong if you don't have evals

The thing I see at every team that hasn't built evals:

Prompt changes are reverted, then re-applied, then reverted again, because no one knows what's better.
A model upgrade that should be a win is delayed by months because no one trusts it without "more testing."
Stakeholders disagree about quality and the team becomes paralyzed.
A regression slips into production and isn't noticed for weeks.

All of that is solved by 200 lines of Python and a frozen test set.

What evals don't replace

Production monitoring. Eval scores can be great while production users still hate the experience. Track real user signals.
Human review. For high-stakes domains (medical, legal, financial), evals are necessary but not sufficient. Keep humans in the loop.
Red-teaming. Adversarial testing is its own practice. Don't expect your standard eval set to catch novel jailbreaks.

When to graduate to a platform

Build your own harness until one of these is true:

You have many models and many pipelines and you need a registry.
You need user-facing eval reports for stakeholders or compliance.
You need to involve humans in scoring at scale.

Then look at platforms. Until then, plain Python beats every shiny tool.

The discipline

Evals don't make you smart about your LLM features. They make you less likely to be wrong when you make changes. That's worth far more than it sounds.

If you ship LLM features and don't have evals — that's the highest-leverage thing you can build this quarter.