nltk.translate.ibm_model module

Common methods and classes for all IBM models. See IBMModel1, IBMModel2, IBMModel3, IBMModel4, and IBMModel5 for specific implementations.

The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus.

The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables.

Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention.

Each target word is aligned to exactly one source word or the NULL token.

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.

class nltk.translate.ibm_model.AlignmentInfo[source]

Bases: object

Helper data object for training IBM Models 3 and up

Read-only. For a source sentence and its counterpart in the target language, this class holds information about the sentence pair’s alignment, cepts, and fertility.

Warning: Alignments are one-indexed here, in contrast to nltk.translate.Alignment and AlignedSent, which are zero-indexed This class is not meant to be used outside of IBM models.

__init__(alignment, src_sentence, trg_sentence, cepts)[source]
alignment

tuple(int): Alignment function. alignment[j] is the position in the source sentence that is aligned to the position j in the target sentence.

center_of_cept(i)[source]
Returns

The ceiling of the average positions of the words in the tablet of cept i, or 0 if i is None

cepts

list(list(int)): The positions of the target words, in ascending order, aligned to a source word position. For example, cepts[4] = (2, 3, 7) means that words in positions 2, 3 and 7 of the target sentence are aligned to the word in position 4 of the source sentence

fertility_of_i(i)[source]

Fertility of word in position i of the source sentence

is_head_word(j)[source]
Returns

Whether the word in position j of the target sentence is a head word

previous_cept(j)[source]
Returns

The previous cept of j, or None if j belongs to the first cept

previous_in_tablet(j)[source]
Returns

The position of the previous word that is in the same tablet as j, or None if j is the first word of the tablet

score

float: Optional. Probability of alignment, as defined by the IBM model that assesses this alignment

src_sentence

tuple(str): Source sentence referred to by this object. Should include NULL token (None) in index 0.

trg_sentence

tuple(str): Target sentence referred to by this object. Should have a dummy element in index 0 so that the first word starts from index 1.

zero_indexed_alignment()[source]
Returns

Zero-indexed alignment, suitable for use in external nltk.translate modules like nltk.translate.Alignment

Return type

list(tuple)

class nltk.translate.ibm_model.Counts[source]

Bases: object

Data object to store counts of various parameters during training

__init__()[source]
update_fertility(count, alignment_info)[source]
update_lexical_translation(count, alignment_info, j)[source]
update_null_generation(count, alignment_info)[source]
class nltk.translate.ibm_model.IBMModel[source]

Bases: object

Abstract base class for all IBM models

MIN_PROB = 1e-12
__init__(sentence_aligned_corpus)[source]
best_model2_alignment(sentence_pair, j_pegged=None, i_pegged=0)[source]

Finds the best alignment according to IBM Model 2

Used as a starting point for hill climbing in Models 3 and above, because it is easier to compute than the best alignments in higher models

Parameters
  • sentence_pair (AlignedSent) – Source and target language sentence pair to be word-aligned

  • j_pegged (int) – If specified, the alignment point of j_pegged will be fixed to i_pegged

  • i_pegged (int) – Alignment point to j_pegged

hillclimb(alignment_info, j_pegged=None)[source]

Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one

There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.

Parameters

j_pegged (int) – If specified, the search will be constrained to alignments where j_pegged remains unchanged

Returns

The best alignment found from hill climbing

Return type

AlignmentInfo

init_vocab(sentence_aligned_corpus)[source]
maximize_fertility_probabilities(counts)[source]
maximize_lexical_translation_probabilities(counts)[source]
maximize_null_generation_probabilities(counts)[source]
neighboring(alignment_info, j_pegged=None)[source]

Determine the neighbors of alignment_info, obtained by moving or swapping one alignment point

Parameters

j_pegged (int) – If specified, neighbors that have a different alignment point from j_pegged will not be considered

Returns

A set neighboring alignments represented by their AlignmentInfo

Return type

set(AlignmentInfo)

prob_of_alignments(alignments)[source]
prob_t_a_given_s(alignment_info)[source]

Probability of target sentence and an alignment given the source sentence

All required information is assumed to be in alignment_info and self.

Derived classes should override this method

reset_probabilities()[source]
sample(sentence_pair)[source]

Sample the most probable alignments from the entire alignment space

First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a higher IBM Model. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point.

Hill climbing may be stuck in a local maxima, hence the pegging and trying out of different alignments.

Parameters

sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from

Returns

A set of best alignments represented by their AlignmentInfo and the best alignment of the set for convenience

Return type

set(AlignmentInfo), AlignmentInfo

set_uniform_probabilities(sentence_aligned_corpus)[source]

Initialize probability tables to a uniform distribution

Derived classes should implement this accordingly.

nltk.translate.ibm_model.longest_target_sentence_length(sentence_aligned_corpus)[source]
Parameters

sentence_aligned_corpus (list(AlignedSent)) – Parallel corpus under consideration

Returns

Number of words in the longest target language sentence of sentence_aligned_corpus