AI Evals (Part A): Evaluating LLM Correctness

When people talk about whether an LLM "got it right," they usually collapse three very different failures into one word. For anyone building or evaluating an AI product, separating them matters — because each fails for different reasons, gets caught with different methods, and gets fixed with different levers.

This is the Correctness cluster. Here are its three failure modes.

1. Hallucination — inventing what isn't there

A hallucination is when the model states something that isn't supported by any source — it fabricates.

Why it happens

An LLM is a next-token predictor, not a fact-lookup engine. A few root causes stack up:

Training objective. It's optimized to produce plausible text, not true text. Fluency is not accuracy.
Lossy memory. Facts are compressed into weights, not stored verbatim. Gaps get filled with statistically likely guesses.
No built-in "I don't know" reflex. The model almost always produces something; uncertainty rarely surfaces as abstention.
Missing context. In a retrieval (RAG) setup, if retrieval misses, the model invents to fill the hole. Much of what looks like hallucination is actually an upstream retrieval failure.

How to catch it

Use an LLM judge for groundedness, backed by human spot-checks.

The judge doesn't trust the answer blindly — it checks the answer against the retrieved context:

Groundedness check. Feed the judge the retrieved context plus the generated answer, and ask: "Is every claim here supported by the context?" Unsupported claims get flagged.
Claim decomposition. Strong judges break the answer into atomic claims and verify each separately, so a single fabricated fact in an otherwise-correct answer still gets caught.
Verdict. Output a label (supported / unsupported / partial) or a score you can threshold and track.

Why this works: judging "does X follow from Y" is an easier task than generating X correctly. The judge does entailment against text in front of it, so it's far less prone to hallucinate itself.

Critical caveat: the judge is only as good as the reference you give it. Ask "is this true?" without a source and the judge hallucinates too. Groundedness judging always needs the source text.

2. Unfaithfulness — ignoring the source you were given

Unfaithfulness is the subtler cousin of hallucination. The model had the right context but ignored or distorted it, pulling from its own priors instead.

Hallucination vs. unfaithfulness — the dividing line

Hallucination: the source didn't contain the info; the model invented it.
Unfaithfulness: the source had it (or contradicted the answer); the model deviated anyway.

That distinction is the whole game.

Why it happens

Parametric override. The model's training-memorized "fact" overrides the retrieved text. (Source says the price is $49; model says $39 because that's what it "knows.") Here, priors means the knowledge baked into the model's weights — its default beliefs. When context and priors conflict, an unfaithful model trusts what it memorized over what you gave it.
Partial reading. Long context; the model anchors on the top and under-attends to the relevant chunk further down. This is an input-side failure — it never properly took in the info.
Over-summarizing. The model did read the right chunk but, when generating, condensed too aggressively and dropped a qualifier. This is an output-side failure — it took the info in but left it out of the answer. The source chunk stays fully present in the context window the whole time. Example — source: "The refund applies only to annual plans purchased after Jan 2025." Over-summarized answer: "You get a refund." The model read the full sentence but threw away every qualifier.

Partial reading and over-summarizing look similar but fail at different stages — one is a reading/attention miss, the other a writing/condensing loss. Both make the answer deviate from the source, so an evaluator flags both, but you fix them differently.

How to catch it

Use an LLM judge for faithfulness, often via the G-EVAL methodology, plus a golden set with known-correct sources.

Faithfulness via entailment. Entailment is a logic/NLP relationship: does claim B follow from text A?

Entailed → A supports B
Contradicted → A disproves B
Neutral → A neither proves nor disproves B

Faithfulness judging reframes "is this faithful?" as "is each claim entailed by the source?" — turning a fuzzy quality question into a checkable logical one. Mechanically, the judge is doing structured reading comprehension: given {source as premise, claim as hypothesis}, it outputs ENTAILED / CONTRADICTED / NEUTRAL with reasoning, ideally decomposing the answer into atomic claims and scoring each.

G-EVAL — a framework for using an LLM as a quality judge that aligns well with human ratings. Three parts:

Criteria → auto-generated evaluation steps. Give the judge a criterion ("faithfulness: does the answer stay true to the source?"). G-EVAL has the LLM first generate its own step-by-step rubric (chain-of-thought), then evaluate using it. Reasoning-before-scoring boosts accuracy.
Form-filling scoring. The judge fills in a score on a defined scale given {criterion + steps + input + output}.
Probability-weighted scores — the clever bit. LLMs cluster on a few integers (always "3" or "5"), giving coarse, tied scores. G-EVAL takes the token probabilities over each possible score and computes the expected value: score = Σ p(s) × s. If the model is 70% on "4" and 30% on "5," the final score is 4.3 — continuous and fine-grained, correlating far better with human judgment.

Caveat: the probability-weighting step needs logprob access from the model API. Without logprobs, you fall back to the raw integer score (coarser).

The golden set is the human-built ground truth this is measured against — a curated, fixed test set where, for each input, a human has verified both the right answer and the source it should draw from. It serves two jobs here: it supplies the verified correct source to feed the judge, and its human faithfulness scores are used to validate the judge itself (does the judge agree with humans?) before you trust it on live, unlabeled traffic.

3. Calibration / overconfidence — when confidence doesn't match accuracy

Calibration asks: does the model's stated or internal confidence match its actual accuracy? A well-calibrated model that says "90% sure" is right about 90% of the time. Overconfidence is saying 90% but being right 60%.

Two types of confidence

Verbalized — the model says "I'm fairly confident" or "95%." Easy to read, often poorly calibrated (models tend to sound sure).
Logprob-based — the model's token probabilities for its answer. A more objective signal, computable mathematically with no judge needed — which is why this is often a deterministic check (when logprobs are available).

Why it matters

If a product expresses confidence — or gates actions on it ("only auto-answer if confident") — miscalibration is dangerous. An overconfident wrong output is worse than an honest "I'm not sure." Calibration tells you whether confidence is a usable signal for routing and escalation.

Mitigation levers, by failure mode

The cleanest way to think about fixes is where in the pipeline you intervene.

Hallucination

Retrieval: improve recall (better embeddings, reranking, larger/more chunks) so the answer can be grounded — most RAG hallucination is a retrieval miss.
Prompt: "answer only from the provided context; if it's not there, say so." Give an explicit abstention escape hatch.
Generation: lower temperature; require inline citations to source spans.
Post-hoc: a groundedness judge gates output — unsupported claims trigger a regenerate, a hedge, or a fallback to "I don't know."

Unfaithfulness

Retrieval/positioning: place the key chunk where attention is strong; cut distractor chunks. (Fixes partial reading.)
Prompt: "preserve all qualifiers, numbers, and conditions"; quote-then-answer — extract relevant spans first, then answer from the extraction.
Generation: don't over-constrain length — too short an output budget forces dropping qualifiers. (Fixes over-summarizing.)
Decompose: answer claim-by-claim against the source rather than in one freeform pass. In practice this is a prompt or a two-step chain ("draft, then break into atomic claims and label each as entailed / not"), and the same move powers the faithfulness judge.
Post-hoc: a faithfulness/entailment judge rejects answers with non-entailed claims.

Calibration / overconfidence

Elicit a better signal: prefer logprobs over verbalized confidence; or use self-consistency (sample N answers, use agreement rate as a confidence proxy — a pure black-box method that works when logprobs aren't exposed).
Prompt: ask for explicit uncertainty; allow "I don't know."
Post-hoc recalibration: temperature scaling or Platt scaling maps raw scores onto true accuracy (cheap, effective, needs a labeled calibration set).
Product-level: gate on confidence — auto-answer above a threshold; below it, escalate to a human, hedge, or ask a clarifying question.

A note on logprobs: the most rigorous calibration (logprob-based confidence + temperature scaling) requires logprob access. When the API hides logprobs, self-consistency and verbalized confidence are the practical fallbacks — coarser, but black-box-friendly. Product-level gating works on whatever signal you have.

The cross-cutting lever

A strong abstention path — "not in the source" / "I'm not sure" — helps all three modes. It converts confident-wrong outputs into honest-uncertain ones, which is almost always the safer failure for a user-facing product.

The takeaway

"Correctness" is three problems, not one:

Failure mode	What broke	Primary eval	Offline / Online
Hallucination	Invented info with no source	LLM judge (groundedness) + human spot-check	Both
Unfaithfulness	Had the source, deviated from it	LLM judge (faithfulness / G-EVAL)	Both
Calibration / overconfidence	Confidence ≠ accuracy	Deterministic (if logprobs) + human	Offline

Note: Offline means evaluation before/outside production, on a fixed dataset, with no real users involved. Online means evaluation in production, on live, real user traffic.

Get the diagnosis right and the fix follows. Most of the time, the cheapest durable win across all three is simply giving the model — and your product — permission to say "I don't know."