NLP Metrics for Captioning

01 N-grams

An n-gram is a contiguous sequence of n words extracted from a sentence. All four metrics below rely on comparing n-gram sets between a candidate caption and a reference caption.

Example sentence

acatsits onthemat

Unigrams (n=1) — individual words

a cat sits on the mat

Bigrams (n=2) — consecutive pairs

acat catsits sitson onthe themat

Trigrams (n=3)

acatsits catsitson sitsonthe onthemat

4-grams (n=4)

acatsitson catsitsonthe sitsonthemat

A sentence of length L has L − n + 1 n-grams. Higher-order n-grams capture more specific phrases — matching a 4-gram is strong evidence the candidate phrased something the same way as the reference.

02 TF-IDF

Term Frequency × Inverse Document Frequency — a weighting scheme that asks: "Is this word important in this specific sentence, and is it rare across the whole corpus?"

TF — Term Frequency

How often a word (or n-gram) appears in this sentence, normalized by total count.

TF for "the cat sat on the mat"

"the"2/6 = 0.33

"cat"1/6 = 0.17

"sat"1/6 = 0.17

"on"1/6 = 0.17

"mat"1/6 = 0.17

IDF — Inverse Document Frequency

How rare is this word across all documents (captions) in the evaluation corpus? Words appearing everywhere get crushed; rare words get boosted.

IDF(ω) = log( |corpus| / number of docs containing ω )
// "the" appears in 950/1000 captions → IDF ≈ 0.05 (near zero, boring)
// "skateboard" appears in 12/1000 captions → IDF ≈ 4.4 (rare, distinctive!)

⚠ Corpus size matters. IDF is estimated from the evaluation corpus itself. With only 50 captions, IDF estimates are very noisy — a word appearing in 1 vs 2 captions changes the weight by 2×. With 1000+ captions, IDF stabilizes and gives reliable weights. This is a practical concern when using CIDEr (which relies on TF-IDF) on small evaluation sets.

TF × IDF: the full picture

Multiply them together. A word scores high only if it's frequent in this sentence AND rare in the corpus.

TF-IDF weights for a captioning corpus

"skateboard"high TF-IDF ★

"grinding"high TF-IDF

"person"low — very common

"the"≈0

"a"≈0

The caption "a person is skateboarding" is dominated by "skateboarding" — the only word that actually identifies the content.

From scalar to vector

TF-IDF gives a single number per n-gram. But to compare two entire sentences, we need to compare all their n-gram weights at once. The idea: treat every unique n-gram in the corpus vocabulary as a dimension of a vector space. Each sentence becomes a sparse vector — one entry per possible n-gram, with TF-IDF as the value (zero if that n-gram doesn't appear in the sentence).

Two sentences as TF-IDF vectors (unigrams only, simplified)

Each row is a vector. Each column is one dimension — a unique word from the corpus vocabulary. The value in each cell is the TF-IDF weight of that word in that sentence (0 if absent).

	"a"	"person"	"skate- board"	"on"	"railing"	"cat"	"the"	"mat"
Candidate "a person skateboards on a railing"	0.01	0.18	1.42	0.03	1.85	0	0	0
Reference "a person is skateboarding on the railing"	0.01	0.15	1.30	0.02	1.65	0	0.01	0

Both vectors "point in the same direction" — their largest values are in the same dimensions (skateboard, railing). The near-zero entries ("a", "on", "the") barely affect the direction. Cosine similarity measures the angle between these two vectors: close to 1.0 if they emphasize the same distinctive words, close to 0.0 if they don't.

Key insight: Each sentence is a point in a high-dimensional space where the axes are n-grams. TF-IDF determines the coordinates. Cosine similarity compares directions rather than magnitudes — so a 5-word candidate and a 20-word reference can still score well if they emphasize the same distinctive n-grams. This is exactly what CIDEr computes.

03 BLEU

Bilingual Evaluation Understudy (Papineni et al., 2002). Originally from machine translation. Asks: "What fraction of the candidate's n-grams appear in the reference?" — it's a precision-focused metric.

How BLEU works

Count n-gram matches (clipped).
For each n-gram in the candidate, check if it appears in the reference. Clip the count so you can't get credit for repeating a word more times than it appears in the reference. This prevents gaming by repetition — generating "the the the the" can't score high.

Compute precision per n-gram order.
p_n = (clipped matches for n-grams of length n) / (total n-grams of length n in candidate). Do this for n = 1, 2, 3, 4.

Brevity Penalty (BP).
If the candidate is shorter than the reference, apply an exponential penalty. This stops the metric from being gamed by very short, safe outputs.

Combine with geometric mean.
BLEU is the product of BP and the geometric mean of all p_n. The geometric mean means if any p_n = 0, BLEU = 0.

BLEU = BP · exp( Σ (1/4) · log p_n ) for n = 1..4

BP = 1 if candidate_len ≥ ref_len
BP = exp(1 − ref_len/candidate_len) otherwise

Worked example

Reference: "a cat sits on the mat"

Candidate: "a cat is on the mat"

Unigram matches: a ✓ cat ✓ is ✗ on ✓ the ✓ mat ✓ → p₁ = 5/6
Bigram matches: "a cat" ✓, "cat is" ✗, "is on" ✗, "on the" ✓, "the mat" ✓ → p₂ = 3/5
Trigrams: "a cat is" ✗, "cat is on" ✗, "is on the" ✗, "on the mat" ✓ → p₃ = 1/4
4-grams: none match → p₄ = 0

BLEU = 0 because p₄ = 0 kills the geometric mean. In practice, smoothing is applied to avoid this.

Weakness for captioning: BLEU is precision-only — it doesn't penalize missing important reference words. A candidate that says "a cat" gets perfect unigram precision despite ignoring "sits on the mat." The brevity penalty partially compensates but it's a blunt instrument. BLEU also treats all n-grams equally — "the" counts the same as "skateboard."

04 ROUGE-L

Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004). The ROUGE-L variant uses the Longest Common Subsequence (LCS). Unlike n-grams, a subsequence doesn't need to be contiguous — words just need to appear in the same order.

LCS example

Reference: "a cat sits on the mat"

Candidate: "the cat was sitting on a mat"

LCS:

cat ··· on ··· mat

LCS length = 3 (could also be "a … on … mat" — same length). These words appear in both sentences in the same relative order, but not necessarily next to each other.

P_lcs = LCS(c, r) / len(c)    // precision — what fraction of the candidate is "in order"
R_lcs = LCS(c, r) / len(r)    // recall — what fraction of the reference is covered

ROUGE-L = (1 + β²) · P · R / (R + β² · P)    // F-measure, β typically 1.2

Intuition: LCS captures structural similarity — are the same words appearing in roughly the same order? — without demanding exact contiguous matches. It's more forgiving than n-gram overlap but coarser than the other metrics. It treats all words equally and doesn't understand synonyms ("sits" ≠ "sitting") or word importance.

05 METEOR

Metric for Evaluation of Translation with Explicit ORdering (Banerjee & Lavie, 2005). The most linguistically aware of these metrics. It aligns words between candidate and reference using three levels of matching, then scores the quality and order of that alignment.

Stage 1 — Word Alignment

Find the best one-to-one alignment between candidate and reference words, trying matchers in priority order:

Three match types (in priority order)

Ref: "the team wins a goal and they celebrate"

Cand: "the team won a goal and they are celebrating"

the team won a goal and they are celebrating

Exact match

Stem match (won→win = wins→win, celebrating→celebrate)

Synonym match (via WordNet, e.g. "car" ↔ "automobile")

No match

Stage 2 — Scoring

P = matched / len(candidate)
R = matched / len(reference)

F = (10 · P · R) / (R + 9·P) // recall-weighted harmonic mean
// 9:1 ratio means missing a reference word hurts ~9× more than an extra candidate word

// Fragmentation penalty — fewer chunks = better word order preservation
Pen = 0.5 · (chunks / matched)³

METEOR = F · (1 − Pen)

Chunks — why word order matters

After alignment, group the matched words into contiguous "chunks" — maximal runs of adjacent aligned words. Fewer chunks means the matched words appear in the original order without interruptions.

Good order (1 chunk): [the cat sits on the mat] → Pen ≈ 0.08

Scrambled (3 chunks): [the mat] is where [the cat] always [sits] → Pen ≈ 0.19

The cubic exponent makes this penalty gentle for small reorderings but harsh for major scrambles.

Key advantages: Stem and synonym matching make METEOR robust to paraphrasing. "The person is drumming" partially matches "she plays the drums" (stem: drumming→drum), where pure n-gram metrics would see zero overlap. The recall-heavy weighting means missing reference content hurts much more than generating extra words — appropriate for captioning where completeness matters.

06 CIDEr

Consensus-based Image Description Evaluation (Vedantam et al., 2015). The only metric here designed specifically for image/video captioning. It combines TF-IDF weighting (from Section 02) with cosine similarity. The core idea: reward n-grams that are specific to this image, not generic filler.

How CIDEr works

Build TF-IDF vectors for n = 1, 2, 3, 4.
For each n-gram order, represent both the candidate and reference as vectors (see Section 02 — "From scalar to vector"). Each dimension is one unique n-gram from the corpus vocabulary; its value is the TF-IDF weight of that n-gram in the sentence (0 if absent). "A person" gets low weight; "skateboard on a railing" gets high weight.

Cosine similarity per n-gram order.
CIDEr_n = cosine(candidate vector, reference vector). Are the two sparse vectors pointing in the same direction in n-gram space? With multiple references, average the cosine over all refs.

Average over n-gram orders, scale by 10.
CIDEr = 10 × (1/4) × Σ CIDEr_n for n=1..4. Perfect score = 10.0.

// For n-gram ω in sentence s, corpus D:
g_ω(s) = TF(ω, s) · IDF(ω, D)

// Collect all g_ω(s) into a vector g_n(s) — one entry per unique n-gram in the vocabulary

// Per n-gram order, cosine similarity (averaged over references):
CIDEr_n(c, R) = (1/|R|) Σ_r∈R ( g_n(c) · g_n(r) ) / ( ‖g_n(c)‖ · ‖g_n(r)‖ )

// Final score:
CIDEr(c, R) = (10/4) Σ_n=1..4 CIDEr_n(c, R)

Why cosine similarity? Cosine measures the angle between two vectors, ignoring their magnitude. This makes it length-invariant — a verbose 30-word candidate and a terse 8-word reference can still score well if they emphasize the same distinctive n-grams. This is a major advantage over BLEU, where length mismatch directly penalizes precision.

⚠ The IDF dependency. Because IDF is computed from the evaluation corpus, CIDEr scores are only meaningful when the corpus is large enough for reliable frequency estimates. With a few dozen captions, the difference between a word appearing in 1 vs 2 captions causes a 2× swing in IDF weight. With thousands of captions, the IDF landscape stabilizes and CIDEr becomes the most informative captioning metric.

07 Side-by-Side Comparison

BLEU

n-gram precision

"What fraction of the candidate's n-grams appear in the reference?"

ROUGE-L

subsequence

"What's the longest in-order word overlap?"

METEOR

alignment

"Can we align words using stems and synonyms, preserving order?"

CIDEr

TF-IDF cosine

"Do the candidate and reference share the same distinctive n-grams?"

Properties at a glance

Property	BLEU	ROUGE-L	METEOR	CIDEr
Based on	n-gram precision	LCS (F-measure)	word alignment	TF-IDF cosine
Precision vs Recall	Precision-heavy	Recall-weighted	Recall-heavy (9:1)	Balanced (cosine)
Handles synonyms?	No	No	Yes (WordNet + stems)	No
Word order?	Via higher n-grams	Yes (subsequence)	Yes (chunk penalty)	Via higher n-grams
Weighs word importance?	No — all equal	No — all equal	No — all equal	Yes — via IDF
Needs large corpus?	No	No	No	Yes (for stable IDF)
Length-invariant?	No (BP is crude)	Partially (F-measure)	Partially (F-measure)	Yes (cosine)
Scale	0 – 1	0 – 1	0 – 1	0 – 10
Originally designed for	Machine translation	Summarization	Machine translation	Image captioning ★

Which metric to trust for captioning?

CIDEr is the standard primary metric in image and video captioning — it's the only one here designed for this task, and its TF-IDF weighting directly rewards content-specific descriptions over safe generic text. METEOR is a strong secondary metric — robust at any corpus size and handles paraphrasing through stem and synonym matching. ROUGE-L is useful as a sanity check for structural overlap. BLEU is worth reporting for completeness and cross-paper comparison, but is generally the least informative for captioning evaluation.

NLP Metrics for Image & Video Captioning

01 N-grams

02 TF-IDF

TF — Term Frequency

IDF — Inverse Document Frequency

TF × IDF: the full picture

From scalar to vector

03 BLEU

04 ROUGE-L

05 METEOR

Stage 1 — Word Alignment

Stage 2 — Scoring

06 CIDEr

07 Side-by-Side Comparison

Which metric to trust for captioning?