When people talk about whether an LLM "got it right," they usually collapse three very different failures into one word. For anyone building or evaluating an AI product, separating them matters — because each fails for different reasons, gets caught with different methods, and gets fixed with different levers.

This is the Correctness cluster. Here are its three failure modes.

1. Hallucination — inventing what isn't there

A hallucination is when the model states something that isn't supported by any source — it fabricates.

Why it happens

An LLM is a next-token predictor, not a fact-lookup engine. A few root causes stack up:

How to catch it

Use an LLM judge for groundedness, backed by human spot-checks.

The judge doesn't trust the answer blindly — it checks the answer against the retrieved context:

  1. Groundedness check. Feed the judge the retrieved context plus the generated answer, and ask: "Is every claim here supported by the context?" Unsupported claims get flagged.
  2. Claim decomposition. Strong judges break the answer into atomic claims and verify each separately, so a single fabricated fact in an otherwise-correct answer still gets caught.
  3. Verdict. Output a label (supported / unsupported / partial) or a score you can threshold and track.

Why this works: judging "does X follow from Y" is an easier task than generating X correctly. The judge does entailment against text in front of it, so it's far less prone to hallucinate itself.

Critical caveat: the judge is only as good as the reference you give it. Ask "is this true?" without a source and the judge hallucinates too. Groundedness judging always needs the source text.

2. Unfaithfulness — ignoring the source you were given

Unfaithfulness is the subtler cousin of hallucination. The model had the right context but ignored or distorted it, pulling from its own priors instead.

Hallucination vs. unfaithfulness — the dividing line

That distinction is the whole game.

Why it happens

Partial reading and over-summarizing look similar but fail at different stages — one is a reading/attention miss, the other a writing/condensing loss. Both make the answer deviate from the source, so an evaluator flags both, but you fix them differently.

How to catch it

Use an LLM judge for faithfulness, often via the G-EVAL methodology, plus a golden set with known-correct sources.

Faithfulness via entailment. Entailment is a logic/NLP relationship: does claim B follow from text A?

Faithfulness judging reframes "is this faithful?" as "is each claim entailed by the source?" — turning a fuzzy quality question into a checkable logical one. Mechanically, the judge is doing structured reading comprehension: given {source as premise, claim as hypothesis}, it outputs ENTAILED / CONTRADICTED / NEUTRAL with reasoning, ideally decomposing the answer into atomic claims and scoring each.

G-EVAL — a framework for using an LLM as a quality judge that aligns well with human ratings. Three parts:

  1. Criteria → auto-generated evaluation steps. Give the judge a criterion ("faithfulness: does the answer stay true to the source?"). G-EVAL has the LLM first generate its own step-by-step rubric (chain-of-thought), then evaluate using it. Reasoning-before-scoring boosts accuracy.
  2. Form-filling scoring. The judge fills in a score on a defined scale given {criterion + steps + input + output}.
  3. Probability-weighted scores — the clever bit. LLMs cluster on a few integers (always "3" or "5"), giving coarse, tied scores. G-EVAL takes the token probabilities over each possible score and computes the expected value: score = Σ p(s) × s. If the model is 70% on "4" and 30% on "5," the final score is 4.3 — continuous and fine-grained, correlating far better with human judgment.

Caveat: the probability-weighting step needs logprob access from the model API. Without logprobs, you fall back to the raw integer score (coarser).

The golden set is the human-built ground truth this is measured against — a curated, fixed test set where, for each input, a human has verified both the right answer and the source it should draw from. It serves two jobs here: it supplies the verified correct source to feed the judge, and its human faithfulness scores are used to validate the judge itself (does the judge agree with humans?) before you trust it on live, unlabeled traffic.

3. Calibration / overconfidence — when confidence doesn't match accuracy

Calibration asks: does the model's stated or internal confidence match its actual accuracy? A well-calibrated model that says "90% sure" is right about 90% of the time. Overconfidence is saying 90% but being right 60%.

Two types of confidence

  1. Verbalized — the model says "I'm fairly confident" or "95%." Easy to read, often poorly calibrated (models tend to sound sure).
  2. Logprob-based — the model's token probabilities for its answer. A more objective signal, computable mathematically with no judge needed — which is why this is often a deterministic check (when logprobs are available).

Why it matters

If a product expresses confidence — or gates actions on it ("only auto-answer if confident") — miscalibration is dangerous. An overconfident wrong output is worse than an honest "I'm not sure." Calibration tells you whether confidence is a usable signal for routing and escalation.

Mitigation levers, by failure mode

The cleanest way to think about fixes is where in the pipeline you intervene.

Hallucination

Unfaithfulness

Calibration / overconfidence

A note on logprobs: the most rigorous calibration (logprob-based confidence + temperature scaling) requires logprob access. When the API hides logprobs, self-consistency and verbalized confidence are the practical fallbacks — coarser, but black-box-friendly. Product-level gating works on whatever signal you have.

The cross-cutting lever

A strong abstention path — "not in the source" / "I'm not sure" — helps all three modes. It converts confident-wrong outputs into honest-uncertain ones, which is almost always the safer failure for a user-facing product.

The takeaway

"Correctness" is three problems, not one:

Failure mode What broke Primary eval Offline / Online
Hallucination Invented info with no source LLM judge (groundedness) + human spot-check Both
Unfaithfulness Had the source, deviated from it LLM judge (faithfulness / G-EVAL) Both
Calibration / overconfidence Confidence ≠ accuracy Deterministic (if logprobs) + human Offline

Note: Offline means evaluation before/outside production, on a fixed dataset, with no real users involved. Online means evaluation in production, on live, real user traffic.

Get the diagnosis right and the fix follows. Most of the time, the cheapest durable win across all three is simply giving the model — and your product — permission to say "I don't know."