nltk.translate.ibm4 module

Translation model that reorders output words based on their type and distance from other related words in the output sentence.

IBM Model 4 improves the distortion model of Model 3, motivated by the observation that certain words tend to be re-ordered in a predictable way relative to one another. For example, <adjective><noun> in English usually has its order flipped as <noun><adjective> in French.

Model 4 requires words in the source and target vocabularies to be categorized into classes. This can be linguistically driven, like parts of speech (adjective, nouns, prepositions, etc). Word classes can also be obtained by statistical methods. The original IBM Model 4 uses an information theoretic approach to group words into 50 classes for each vocabulary.



A source word with non-zero fertility i.e. aligned to one or more target words.


The set of target word(s) aligned to a cept.

Head of cept

The first word of the tablet of that cept.

Center of cept

The average position of the words in that cept’s tablet. If the value is not an integer, the ceiling is taken. For example, for a tablet with words in positions 2, 5, 6 in the target sentence, the center of the corresponding cept is ceil((2 + 5 + 6) / 3) = 5


For a head word, defined as (position of head word - position of previous cept’s center). Can be positive or negative. For a non-head word, defined as (position of non-head word - position of previous word in the same tablet). Always positive, because successive words in a tablet are assumed to appear to the right of the previous word.

In contrast to Model 3 which reorders words in a tablet independently of other words, Model 4 distinguishes between three cases.

  1. Words generated by NULL are distributed uniformly.

  2. For a head word t, its position is modeled by the probability d_head(displacement | word_class_s(s),word_class_t(t)), where s is the previous cept, and word_class_s and word_class_t maps s and t to a source and target language word class respectively.

  3. For a non-head word t, its position is modeled by the probability d_non_head(displacement | word_class_t(t))

The EM algorithm used in Model 4 is:

E step

In the training data, collect counts, weighted by prior probabilities.

    1. count how many times a source language word is translated into a target language word

    1. for a particular word class, count how many times a head word is located at a particular displacement from the previous cept’s center

    1. for a particular word class, count how many times a non-head word is located at a particular displacement from the previous target word

    1. count how many times a source word is aligned to phi number of target words

    1. count how many times NULL is aligned to a target word

M step

Estimate new probabilities based on the counts from the E step

Like Model 3, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates.



Position in the source sentence Valid values are 0 (for NULL), 1, 2, …, length of source sentence


Position in the target sentence Valid values are 1, 2, …, length of target sentence


Number of words in the source sentence, excluding NULL


Number of words in the target sentence


A word in the source language


A word in the target language


Fertility, the number of target words produced by a source word


Probability that a target word produced by a source word is accompanied by another target word that is aligned to NULL


1 - p1


Displacement, Δj


Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.

class nltk.translate.ibm4.IBMModel4[source]

Bases: IBMModel

Translation model that reorders output words based on their type and their distance from other related words in the output sentence

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
>>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
>>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 }
>>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }
>>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes)
>>> print(round(ibm4.translation_table['buch']['book'], 3))
>>> print(round(ibm4.translation_table['das']['book'], 3))
>>> print(round(ibm4.translation_table['ja'][None], 3))
>>> print(round(ibm4.head_distortion_table[1][0][1], 3))
>>> print(round(ibm4.head_distortion_table[2][0][1], 3))
>>> print(round(ibm4.non_head_distortion_table[3][6], 3))
>>> print(round(ibm4.fertility_table[2]['summarize'], 3))
>>> print(round(ibm4.fertility_table[1]['book'], 3))
>>> print(round(ibm4.p1, 3))
>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
__init__(sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None)[source]

Train on sentence_aligned_corpus and create a lexical translation model, distortion models, a fertility model, and a model for generating NULL-aligned words.

Translation direction is from AlignedSent.mots to AlignedSent.words.

  • sentence_aligned_corpus (list(AlignedSent)) – Sentence-aligned parallel corpus

  • iterations (int) – Number of iterations to run training algorithm

  • source_word_classes (dict[str]: int) – Lookup table that maps a source word to its word class, the latter represented by an integer id

  • target_word_classes (dict[str]: int) – Lookup table that maps a target word to its word class, the latter represented by an integer id

  • probability_tables (dict[str]: object) – Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: translation_table, alignment_table, fertility_table, p1, head_distortion_table, non_head_distortion_table. See IBMModel and IBMModel4 for the type and purpose of these tables.

static model4_prob_t_a_given_s(alignment_info, ibm_model)[source]

Probability of target sentence and an alignment given the source sentence


Set distortion probabilities uniformly to 1 / cardinality of displacement values

class nltk.translate.ibm4.Model4Counts[source]

Bases: Counts

Data object to store counts of various parameters during training. Includes counts for distortion.

update_distortion(count, alignment_info, j, src_classes, trg_classes)[source]