nltk.translate.ibm2 module

Lexical translation model that considers word order.

IBM Model 2 improves on Model 1 by accounting for word order. An alignment probability is introduced, a(i | j,l,m), which predicts a source word position, given its aligned target word’s position.

The EM algorithm used in Model 2 is:

E step:

In the training data, collect counts, weighted by prior probabilities.

    1. count how many times a source language word is translated into a target language word

    1. count how many times a particular position in the source sentence is aligned to a particular position in the target sentence

M step:

Estimate new probabilities based on the counts from the E step

Notations

i:

Position in the source sentence Valid values are 0 (for NULL), 1, 2, …, length of source sentence

j:

Position in the target sentence Valid values are 1, 2, …, length of target sentence

l:

Number of words in the source sentence, excluding NULL

m:

Number of words in the target sentence

s:

A word in the source language

t:

A word in the target language

References

Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.

class nltk.translate.ibm2.IBMModel2[source]

Bases: IBMModel

Lexical translation model that considers word order

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> ibm2 = IBMModel2(bitext, 5)
>>> print(round(ibm2.translation_table['buch']['book'], 3))
1.0
>>> print(round(ibm2.translation_table['das']['book'], 3))
0.0
>>> print(round(ibm2.translation_table['buch'][None], 3))
0.0
>>> print(round(ibm2.translation_table['ja'][None], 3))
0.0
>>> print(round(ibm2.alignment_table[1][1][2][2], 3))
0.939
>>> print(round(ibm2.alignment_table[1][2][2][2], 3))
0.0
>>> print(round(ibm2.alignment_table[2][2][4][5], 3))
1.0
>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)])
__init__(sentence_aligned_corpus, iterations, probability_tables=None)[source]

Train on sentence_aligned_corpus and create a lexical translation model and an alignment model.

Translation direction is from AlignedSent.mots to AlignedSent.words.

Parameters:
  • sentence_aligned_corpus (list(AlignedSent)) – Sentence-aligned parallel corpus

  • iterations (int) – Number of iterations to run training algorithm

  • probability_tables (dict[str]: object) – Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: translation_table, alignment_table. See IBMModel for the type and purpose of these tables.

align(sentence_pair)[source]

Determines the best word alignment for one sentence pair from the corpus that the model was trained on.

The best alignment will be set in sentence_pair when the method returns. In contrast with the internal implementation of IBM models, the word indices in the Alignment are zero- indexed, not one-indexed.

Parameters:

sentence_pair (AlignedSent) – A sentence in the source language and its counterpart sentence in the target language

align_all(parallel_corpus)[source]
maximize_alignment_probabilities(counts)[source]
prob_alignment_point(i, j, src_sentence, trg_sentence)[source]

Probability that position j in trg_sentence is aligned to position i in the src_sentence

prob_all_alignments(src_sentence, trg_sentence)[source]

Computes the probability of all possible word alignments, expressed as a marginal distribution over target words t

Each entry in the return value represents the contribution to the total alignment probability by the target word t.

To obtain probability(alignment | src_sentence, trg_sentence), simply sum the entries in the return value.

Returns:

Probability of t for all s in src_sentence

Return type:

dict(str): float

prob_t_a_given_s(alignment_info)[source]

Probability of target sentence and an alignment given the source sentence

set_uniform_probabilities(sentence_aligned_corpus)[source]

Initialize probability tables to a uniform distribution

Derived classes should implement this accordingly.

train(parallel_corpus)[source]
class nltk.translate.ibm2.Model2Counts[source]

Bases: Counts

Data object to store counts of various parameters during training. Includes counts for alignment.

__init__()[source]
update_alignment(count, i, j, l, m)[source]
update_lexical_translation(count, s, t)[source]