NLP Metrics for Image & Video Captioning

A visual guide to the building blocks — n-grams and TF-IDF — and the four metrics used in every captioning paper: BLEU, ROUGE-L, METEOR, and CIDEr.

Every metric below answers the same question in a different way: "How close is a generated caption to the ground truth?" They differ in what they count, how they weight it, and what failure modes they have.

01 N-grams

An n-gram is a contiguous sequence of n words extracted from a sentence. All four metrics below rely on comparing n-gram sets between a candidate caption and a reference caption.

Example sentence
acatsits onthemat
Unigrams (n=1) — individual words
a cat sits on the mat
Bigrams (n=2) — consecutive pairs
acat catsits sitson onthe themat
Trigrams (n=3)
acatsits catsitson sitsonthe onthemat
4-grams (n=4)
acatsitson catsitsonthe sitsonthemat

A sentence of length L has L − n + 1 n-grams. Higher-order n-grams capture more specific phrases — matching a 4-gram is strong evidence the candidate phrased something the same way as the reference.

02 TF-IDF

Term Frequency × Inverse Document Frequency — a weighting scheme that asks: "Is this word important in this specific sentence, and is it rare across the whole corpus?"

TF — Term Frequency

How often a word (or n-gram) appears in this sentence, normalized by total count.

TF for "the cat sat on the mat"
"the"2/6 = 0.33
"cat"1/6 = 0.17
"sat"1/6 = 0.17
"on"1/6 = 0.17
"mat"1/6 = 0.17

IDF — Inverse Document Frequency

How rare is this word across all documents (captions) in the evaluation corpus? Words appearing everywhere get crushed; rare words get boosted.

IDF(ω) = log( |corpus| / number of docs containing ω )
// "the" appears in 950/1000 captions → IDF ≈ 0.05 (near zero, boring)
// "skateboard" appears in 12/1000 captions → IDF ≈ 4.4 (rare, distinctive!)
⚠ Corpus size matters. IDF is estimated from the evaluation corpus itself. With only 50 captions, IDF estimates are very noisy — a word appearing in 1 vs 2 captions changes the weight by 2×. With 1000+ captions, IDF stabilizes and gives reliable weights. This is a practical concern when using CIDEr (which relies on TF-IDF) on small evaluation sets.

TF × IDF: the full picture

Multiply them together. A word scores high only if it's frequent in this sentence AND rare in the corpus.

TF-IDF weights for a captioning corpus
"skateboard"high TF-IDF ★
"grinding"high TF-IDF
"person"low — very common
"the"≈0
"a"≈0

The caption "a person is skateboarding" is dominated by "skateboarding" — the only word that actually identifies the content.

From scalar to vector

TF-IDF gives a single number per n-gram. But to compare two entire sentences, we need to compare all their n-gram weights at once. The idea: treat every unique n-gram in the corpus vocabulary as a dimension of a vector space. Each sentence becomes a sparse vector — one entry per possible n-gram, with TF-IDF as the value (zero if that n-gram doesn't appear in the sentence).

Two sentences as TF-IDF vectors (unigrams only, simplified)

Each row is a vector. Each column is one dimension — a unique word from the corpus vocabulary. The value in each cell is the TF-IDF weight of that word in that sentence (0 if absent).

"a" "person" "skate-
board"
"on" "railing" "cat" "the" "mat"
Candidate
"a person skateboards on a railing"
0.010.181.420.03 1.85000
Reference
"a person is skateboarding on the railing"
0.010.151.300.02 1.6500.010

Both vectors "point in the same direction" — their largest values are in the same dimensions (skateboard, railing). The near-zero entries ("a", "on", "the") barely affect the direction. Cosine similarity measures the angle between these two vectors: close to 1.0 if they emphasize the same distinctive words, close to 0.0 if they don't.

Key insight: Each sentence is a point in a high-dimensional space where the axes are n-grams. TF-IDF determines the coordinates. Cosine similarity compares directions rather than magnitudes — so a 5-word candidate and a 20-word reference can still score well if they emphasize the same distinctive n-grams. This is exactly what CIDEr computes.

03 BLEU

Bilingual Evaluation Understudy (Papineni et al., 2002). Originally from machine translation. Asks: "What fraction of the candidate's n-grams appear in the reference?" — it's a precision-focused metric.

How BLEU works
1
Count n-gram matches (clipped).
For each n-gram in the candidate, check if it appears in the reference. Clip the count so you can't get credit for repeating a word more times than it appears in the reference. This prevents gaming by repetition — generating "the the the the" can't score high.
2
Compute precision per n-gram order.
pn = (clipped matches for n-grams of length n) / (total n-grams of length n in candidate). Do this for n = 1, 2, 3, 4.
3
Brevity Penalty (BP).
If the candidate is shorter than the reference, apply an exponential penalty. This stops the metric from being gamed by very short, safe outputs.
4
Combine with geometric mean.
BLEU is the product of BP and the geometric mean of all pn. The geometric mean means if any pn = 0, BLEU = 0.
BLEU = BP · exp( Σ (1/4) · log pn ) for n = 1..4

BP = 1                    if candidate_len ≥ ref_len
BP = exp(1 − ref_len/candidate_len)   otherwise
Worked example

Reference: "a cat sits on the mat"

Candidate: "a cat is on the mat"

Unigram matches: a ✓ cat ✓ is on ✓ the ✓ mat ✓ → p1 = 5/6
Bigram matches: "a cat" ✓, "cat is" ✗, "is on" ✗, "on the" ✓, "the mat" ✓ → p2 = 3/5
Trigrams: "a cat is" ✗, "cat is on" ✗, "is on the" ✗, "on the mat" ✓ → p3 = 1/4
4-grams: none match → p4 = 0

BLEU = 0 because p4 = 0 kills the geometric mean. In practice, smoothing is applied to avoid this.

Weakness for captioning: BLEU is precision-only — it doesn't penalize missing important reference words. A candidate that says "a cat" gets perfect unigram precision despite ignoring "sits on the mat." The brevity penalty partially compensates but it's a blunt instrument. BLEU also treats all n-grams equally — "the" counts the same as "skateboard."

04 ROUGE-L

Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004). The ROUGE-L variant uses the Longest Common Subsequence (LCS). Unlike n-grams, a subsequence doesn't need to be contiguous — words just need to appear in the same order.

LCS example

Reference: "a cat sits on the mat"

Candidate: "the cat was sitting on a mat"

LCS:

cat ··· on ··· mat

LCS length = 3 (could also be "a … on … mat" — same length). These words appear in both sentences in the same relative order, but not necessarily next to each other.

Plcs = LCS(c, r) / len(c)    // precision — what fraction of the candidate is "in order"
Rlcs = LCS(c, r) / len(r)    // recall — what fraction of the reference is covered

ROUGE-L = (1 + β²) · P · R / (R + β² · P)    // F-measure, β typically 1.2

Intuition: LCS captures structural similarity — are the same words appearing in roughly the same order? — without demanding exact contiguous matches. It's more forgiving than n-gram overlap but coarser than the other metrics. It treats all words equally and doesn't understand synonyms ("sits" ≠ "sitting") or word importance.

05 METEOR

Metric for Evaluation of Translation with Explicit ORdering (Banerjee & Lavie, 2005). The most linguistically aware of these metrics. It aligns words between candidate and reference using three levels of matching, then scores the quality and order of that alignment.

Stage 1 — Word Alignment

Find the best one-to-one alignment between candidate and reference words, trying matchers in priority order:

Three match types (in priority order)

Ref: "the team wins a goal and they celebrate"

Cand: "the team won a goal and they are celebrating"

the team won a goal and they are celebrating
Exact match
Stem match (won→win = wins→win, celebrating→celebrate)
Synonym match (via WordNet, e.g. "car" ↔ "automobile")
No match

Stage 2 — Scoring

P = matched / len(candidate)
R = matched / len(reference)

F = (10 · P · R) / (R + 9·P)    // recall-weighted harmonic mean
// 9:1 ratio means missing a reference word hurts ~9× more than an extra candidate word

// Fragmentation penalty — fewer chunks = better word order preservation
Pen = 0.5 · (chunks / matched)³

METEOR = F · (1 − Pen)
Chunks — why word order matters

After alignment, group the matched words into contiguous "chunks" — maximal runs of adjacent aligned words. Fewer chunks means the matched words appear in the original order without interruptions.

Good order (1 chunk): [the cat sits on the mat] → Pen ≈ 0.08

Scrambled (3 chunks): [the mat] is where [the cat] always [sits] → Pen ≈ 0.19

The cubic exponent makes this penalty gentle for small reorderings but harsh for major scrambles.

Key advantages: Stem and synonym matching make METEOR robust to paraphrasing. "The person is drumming" partially matches "she plays the drums" (stem: drumming→drum), where pure n-gram metrics would see zero overlap. The recall-heavy weighting means missing reference content hurts much more than generating extra words — appropriate for captioning where completeness matters.

06 CIDEr

Consensus-based Image Description Evaluation (Vedantam et al., 2015). The only metric here designed specifically for image/video captioning. It combines TF-IDF weighting (from Section 02) with cosine similarity. The core idea: reward n-grams that are specific to this image, not generic filler.

How CIDEr works
1
Build TF-IDF vectors for n = 1, 2, 3, 4.
For each n-gram order, represent both the candidate and reference as vectors (see Section 02 — "From scalar to vector"). Each dimension is one unique n-gram from the corpus vocabulary; its value is the TF-IDF weight of that n-gram in the sentence (0 if absent). "A person" gets low weight; "skateboard on a railing" gets high weight.
2
Cosine similarity per n-gram order.
CIDErn = cosine(candidate vector, reference vector). Are the two sparse vectors pointing in the same direction in n-gram space? With multiple references, average the cosine over all refs.
3
Average over n-gram orders, scale by 10.
CIDEr = 10 × (1/4) × Σ CIDErn for n=1..4. Perfect score = 10.0.
// For n-gram ω in sentence s, corpus D:
gω(s) = TF(ω, s) · IDF(ω, D)

// Collect all g_ω(s) into a vector gn(s) — one entry per unique n-gram in the vocabulary

// Per n-gram order, cosine similarity (averaged over references):
CIDErn(c, R) = (1/|R|) Σr∈R ( gn(c) · gn(r) ) / ( ‖gn(c)‖ · ‖gn(r)‖ )

// Final score:
CIDEr(c, R) = (10/4) Σn=1..4 CIDErn(c, R)
Why cosine similarity? Cosine measures the angle between two vectors, ignoring their magnitude. This makes it length-invariant — a verbose 30-word candidate and a terse 8-word reference can still score well if they emphasize the same distinctive n-grams. This is a major advantage over BLEU, where length mismatch directly penalizes precision.
⚠ The IDF dependency. Because IDF is computed from the evaluation corpus, CIDEr scores are only meaningful when the corpus is large enough for reliable frequency estimates. With a few dozen captions, the difference between a word appearing in 1 vs 2 captions causes a 2× swing in IDF weight. With thousands of captions, the IDF landscape stabilizes and CIDEr becomes the most informative captioning metric.

07 Side-by-Side Comparison

BLEU
n-gram precision

"What fraction of the candidate's n-grams appear in the reference?"

ROUGE-L
subsequence

"What's the longest in-order word overlap?"

METEOR
alignment

"Can we align words using stems and synonyms, preserving order?"

CIDEr
TF-IDF cosine

"Do the candidate and reference share the same distinctive n-grams?"

Properties at a glance
PropertyBLEUROUGE-LMETEORCIDEr
Based onn-gram precisionLCS (F-measure)word alignmentTF-IDF cosine
Precision vs RecallPrecision-heavyRecall-weightedRecall-heavy (9:1)Balanced (cosine)
Handles synonyms?NoNoYes (WordNet + stems)No
Word order?Via higher n-gramsYes (subsequence)Yes (chunk penalty)Via higher n-grams
Weighs word importance?No — all equalNo — all equalNo — all equalYes — via IDF
Needs large corpus?NoNoNoYes (for stable IDF)
Length-invariant?No (BP is crude)Partially (F-measure)Partially (F-measure)Yes (cosine)
Scale0 – 10 – 10 – 10 – 10
Originally designed forMachine translationSummarizationMachine translationImage captioning ★

Which metric to trust for captioning?

CIDEr is the standard primary metric in image and video captioning — it's the only one here designed for this task, and its TF-IDF weighting directly rewards content-specific descriptions over safe generic text. METEOR is a strong secondary metric — robust at any corpus size and handles paraphrasing through stem and synonym matching. ROUGE-L is useful as a sanity check for structural overlap. BLEU is worth reporting for completeness and cross-paper comparison, but is generally the least informative for captioning evaluation.


A visual guide to NLP evaluation metrics for image and video captioning.