A visual guide to the building blocks — n-grams and TF-IDF — and the four metrics used in every captioning paper: BLEU, ROUGE-L, METEOR, and CIDEr.
Every metric below answers the same question in a different way: "How close is a generated caption to the ground truth?" They differ in what they count, how they weight it, and what failure modes they have.
An n-gram is a contiguous sequence of n words extracted from a sentence. All four metrics below rely on comparing n-gram sets between a candidate caption and a reference caption.
A sentence of length L has L − n + 1 n-grams. Higher-order n-grams capture more specific phrases — matching a 4-gram is strong evidence the candidate phrased something the same way as the reference.
Term Frequency × Inverse Document Frequency — a weighting scheme that asks: "Is this word important in this specific sentence, and is it rare across the whole corpus?"
How often a word (or n-gram) appears in this sentence, normalized by total count.
How rare is this word across all documents (captions) in the evaluation corpus? Words appearing everywhere get crushed; rare words get boosted.
Multiply them together. A word scores high only if it's frequent in this sentence AND rare in the corpus.
The caption "a person is skateboarding" is dominated by "skateboarding" — the only word that actually identifies the content.
TF-IDF gives a single number per n-gram. But to compare two entire sentences, we need to compare all their n-gram weights at once. The idea: treat every unique n-gram in the corpus vocabulary as a dimension of a vector space. Each sentence becomes a sparse vector — one entry per possible n-gram, with TF-IDF as the value (zero if that n-gram doesn't appear in the sentence).
Each row is a vector. Each column is one dimension — a unique word from the corpus vocabulary. The value in each cell is the TF-IDF weight of that word in that sentence (0 if absent).
| "a" | "person" | "skate- board" |
"on" | "railing" | "cat" | "the" | "mat" | |
|---|---|---|---|---|---|---|---|---|
| Candidate "a person skateboards on a railing" |
0.01 | 0.18 | 1.42 | 0.03 | 1.85 | 0 | 0 | 0 |
| Reference "a person is skateboarding on the railing" |
0.01 | 0.15 | 1.30 | 0.02 | 1.65 | 0 | 0.01 | 0 |
Both vectors "point in the same direction" — their largest values are in the same dimensions (skateboard, railing). The near-zero entries ("a", "on", "the") barely affect the direction. Cosine similarity measures the angle between these two vectors: close to 1.0 if they emphasize the same distinctive words, close to 0.0 if they don't.
Bilingual Evaluation Understudy (Papineni et al., 2002). Originally from machine translation. Asks: "What fraction of the candidate's n-grams appear in the reference?" — it's a precision-focused metric.
Reference: "a cat sits on the mat"
Candidate: "a cat is on the mat"
Unigram matches: a ✓ cat ✓ is ✗ on ✓ the ✓ mat ✓ → p1 = 5/6
Bigram matches: "a cat" ✓, "cat is" ✗, "is on" ✗, "on the" ✓, "the mat" ✓ → p2 = 3/5
Trigrams: "a cat is" ✗, "cat is on" ✗, "is on the" ✗, "on the mat" ✓ → p3 = 1/4
4-grams: none match → p4 = 0
BLEU = 0 because p4 = 0 kills the geometric mean. In practice, smoothing is applied to avoid this.
Weakness for captioning: BLEU is precision-only — it doesn't penalize missing important reference words. A candidate that says "a cat" gets perfect unigram precision despite ignoring "sits on the mat." The brevity penalty partially compensates but it's a blunt instrument. BLEU also treats all n-grams equally — "the" counts the same as "skateboard."
Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004). The ROUGE-L variant uses the Longest Common Subsequence (LCS). Unlike n-grams, a subsequence doesn't need to be contiguous — words just need to appear in the same order.
Reference: "a cat sits on the mat"
Candidate: "the cat was sitting on a mat"
LCS:
LCS length = 3 (could also be "a … on … mat" — same length). These words appear in both sentences in the same relative order, but not necessarily next to each other.
Intuition: LCS captures structural similarity — are the same words appearing in roughly the same order? — without demanding exact contiguous matches. It's more forgiving than n-gram overlap but coarser than the other metrics. It treats all words equally and doesn't understand synonyms ("sits" ≠ "sitting") or word importance.
Metric for Evaluation of Translation with Explicit ORdering (Banerjee & Lavie, 2005). The most linguistically aware of these metrics. It aligns words between candidate and reference using three levels of matching, then scores the quality and order of that alignment.
Find the best one-to-one alignment between candidate and reference words, trying matchers in priority order:
Ref: "the team wins a goal and they celebrate"
Cand: "the team won a goal and they are celebrating"
After alignment, group the matched words into contiguous "chunks" — maximal runs of adjacent aligned words. Fewer chunks means the matched words appear in the original order without interruptions.
Good order (1 chunk): [the cat sits on the mat] → Pen ≈ 0.08
Scrambled (3 chunks): [the mat] is where [the cat] always [sits] → Pen ≈ 0.19
The cubic exponent makes this penalty gentle for small reorderings but harsh for major scrambles.
Key advantages: Stem and synonym matching make METEOR robust to paraphrasing. "The person is drumming" partially matches "she plays the drums" (stem: drumming→drum), where pure n-gram metrics would see zero overlap. The recall-heavy weighting means missing reference content hurts much more than generating extra words — appropriate for captioning where completeness matters.
Consensus-based Image Description Evaluation (Vedantam et al., 2015). The only metric here designed specifically for image/video captioning. It combines TF-IDF weighting (from Section 02) with cosine similarity. The core idea: reward n-grams that are specific to this image, not generic filler.
"What fraction of the candidate's n-grams appear in the reference?"
"What's the longest in-order word overlap?"
"Can we align words using stems and synonyms, preserving order?"
"Do the candidate and reference share the same distinctive n-grams?"
| Property | BLEU | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|
| Based on | n-gram precision | LCS (F-measure) | word alignment | TF-IDF cosine |
| Precision vs Recall | Precision-heavy | Recall-weighted | Recall-heavy (9:1) | Balanced (cosine) |
| Handles synonyms? | No | No | Yes (WordNet + stems) | No |
| Word order? | Via higher n-grams | Yes (subsequence) | Yes (chunk penalty) | Via higher n-grams |
| Weighs word importance? | No — all equal | No — all equal | No — all equal | Yes — via IDF |
| Needs large corpus? | No | No | No | Yes (for stable IDF) |
| Length-invariant? | No (BP is crude) | Partially (F-measure) | Partially (F-measure) | Yes (cosine) |
| Scale | 0 – 1 | 0 – 1 | 0 – 1 | 0 – 10 |
| Originally designed for | Machine translation | Summarization | Machine translation | Image captioning ★ |
CIDEr is the standard primary metric in image and video captioning — it's the only one here designed for this task, and its TF-IDF weighting directly rewards content-specific descriptions over safe generic text. METEOR is a strong secondary metric — robust at any corpus size and handles paraphrasing through stem and synonym matching. ROUGE-L is useful as a sanity check for structural overlap. BLEU is worth reporting for completeness and cross-paper comparison, but is generally the least informative for captioning evaluation.
A visual guide to NLP evaluation metrics for image and video captioning.