nltk.translate.bleu

nltk.translate.bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None, auto_reweigh=False)

Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: a method for automatic evaluation of machine translation.” In Proceedings of ACL. https://www.aclweb.org/anthology/P02-1040.pdf

>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will', 'forever',
...               'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> sentence_bleu([reference1, reference2, reference3], hypothesis1) 
0.5045...

If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the other n-gram orders). The following example has zero 3-gram and 4-gram overlaps:

>>> round(sentence_bleu([reference1, reference2, reference3], hypothesis2),4) 
0.0

To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.

>>> chencherry = SmoothingFunction()
>>> sentence_bleu([reference1, reference2, reference3], hypothesis2,
...     smoothing_function=chencherry.method1) 
0.0370...

The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:

>>> weights = (1./5., 1./5., 1./5., 1./5., 1./5.)
>>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) 
0.3920...

Multiple BLEU scores can be computed at once, by supplying a list of weights. E.g. for computing BLEU-2, BLEU-3 and BLEU-4 in one computation, use: >>> weights = [ … (1./2., 1./2.), … (1./3., 1./3., 1./3.), … (1./4., 1./4., 1./4., 1./4.) … ] >>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) # doctest: +ELLIPSIS [0.7453…, 0.6240…, 0.5045…]

Parameters
  • references (list(list(str))) – reference sentences

  • hypothesis (list(str)) – a hypothesis sentence

  • weights (tuple(float) / list(tuple(float))) – weights for unigrams, bigrams, trigrams and so on (one or a list of weights)

  • smoothing_function (SmoothingFunction) –

  • auto_reweigh (bool) – Option to re-normalize the weights uniformly.

Returns

The sentence-level BLEU score. Returns a list if multiple weights were supplied.

Return type

float / list(float)