nltk.translate.nist_score module

NIST score implementation.

nltk.translate.nist_score.sentence_nist(references, hypothesis, n=5)[source]

Calculate NIST score from George Doddington. 2002. “Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.” Proceedings of HLT. Morgan Kaufmann Publishers Inc.

DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score. The official script used by NIST to compute BLEU and NIST score is The main differences are:

  • BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean.

  • NIST has a different brevity penalty

  • NIST score from has a self-contained tokenizer

Note: The includes a smoothing function for BLEU score that is NOT

used in the NIST score computation.

>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will', 'forever',
...               'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> sentence_nist([reference1, reference2, reference3], hypothesis1) 
>>> sentence_nist([reference1, reference2, reference3], hypothesis2) 
  • references (list(list(str))) – reference sentences

  • hypothesis (list(str)) – a hypothesis sentence

  • n (int) – highest n-gram order

nltk.translate.nist_score.corpus_nist(list_of_references, hypotheses, n=5)[source]

Calculate a single corpus-level NIST score (aka. system-level BLEU) for all the hypotheses and their respective references.

  • references (list(list(list(str)))) – a corpus of lists of reference sentences, w.r.t. hypotheses

  • hypotheses (list(list(str))) – a list of hypothesis sentences

  • n (int) – highest n-gram order

nltk.translate.nist_score.nist_length_penalty(ref_len, hyp_len)[source]

Calculates the NIST length penalty, from Eq. 3 in Doddington (2002)

penalty = exp( beta * log( min( len(hyp)/len(ref) , 1.0 )))


beta is chosen to make the brevity penalty factor = 0.5 when the no. of words in the system output (hyp) is 2/3 of the average no. of words in the reference translation (ref)

The NIST penalty is different from BLEU’s such that it minimize the impact of the score of small variations in the length of a translation. See Fig. 4 in Doddington (2002)