nltk.align package¶

Submodules¶

nltk.align.api module¶

class nltk.align.api.AlignedSent(words=[], mots=[], alignment='', encoding='utf8')[source]¶

Bases: object

Return an aligned sentence object, which encapsulates two sentences along with an Alignment between them.

>>> from nltk.align import AlignedSent
>>> algnsent = AlignedSent(['klein', 'ist', 'das', 'Haus'],
...     ['the', 'house', 'is', 'small'], '0-2 1-3 2-1 3-0')
>>> algnsent.words
['klein', 'ist', 'das', 'Haus']
>>> algnsent.mots
['the', 'house', 'is', 'small']
>>> algnsent.alignment
Alignment([(0, 2), (1, 3), (2, 1), (3, 0)])
>>> algnsent.precision('0-2 1-3 2-1 3-3')
0.75
>>> from nltk.corpus import comtrans
>>> print(comtrans.aligned_sents()[54])
<AlignedSent: 'Weshalb also sollten...' -> 'So why should EU arm...'>
>>> print(comtrans.aligned_sents()[54].alignment)
0-0 0-1 1-0 2-2 3-4 3-5 4-7 5-8 6-3 7-9 8-9 9-10 9-11 10-12 11-6 12-6 13-13

Parameters:	words (list(str)) – source language words mots (list(str)) – target language words alignment (Alignment) – the word-level alignments between the source and target language

alignment¶

alignment_error_rate(reference, possible=None)[source]¶

Return the Alignment Error Rate (AER) of an aligned sentence with respect to a “gold standard” reference AlignedSent.

Return an error rate between 0.0 (perfect alignment) and 1.0 (no alignment).

>>> from nltk.align import AlignedSent
>>> s = AlignedSent(["the", "cat"], ["le", "chat"], [(0, 0), (1, 1)])
>>> s.alignment_error_rate(s)
0.0

Parameters:	reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence. possible (AlignedSent or Alignment or None) – A “gold standard” reference of possible alignments (defaults to reference if None)
Return type:	float or None

invert()[source]¶

Return the aligned sentence pair, reversing the directionality

Return type:	AlignedSent

mots¶

precision(reference)[source]¶

Return the precision of an aligned sentence with respect to a “gold standard” reference AlignedSent.

Parameters:	reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence.
Return type:	float or None

recall(reference)[source]¶

Return the recall of an aligned sentence with respect to a “gold standard” reference AlignedSent.

Parameters:	reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence.
Return type:	float or None

unicode_repr()¶

Return a string representation for this AlignedSent.

Return type:	str

words¶

class nltk.align.api.Alignment[source]¶

Bases: frozenset

A storage class for representing alignment between two sequences, s1, s2. In general, an alignment is a set of tuples of the form (i, j, ...) representing an alignment between the i-th element of s1 and the j-th element of s2. Tuples are extensible (they might contain additional data, such as a boolean to indicate sure vs possible alignments).

>>> from nltk.align import Alignment
>>> a = Alignment([(0, 0), (0, 1), (1, 2), (2, 2)])
>>> a.invert()
Alignment([(0, 0), (1, 0), (2, 1), (2, 2)])
>>> print(a.invert())
0-0 1-0 2-1 2-2
>>> a[0]
[(0, 1), (0, 0)]
>>> a.invert()[2]
[(2, 1), (2, 2)]
>>> b = Alignment([(0, 0), (0, 1)])
>>> b.issubset(a)
True
>>> c = Alignment('0-0 0-1')
>>> b == c
True

invert()[source]¶: Return an Alignment object, being the inverted mapping.

range(positions=None)[source]¶: Work out the range of the mapping from the given positions. If no positions are specified, compute the range of the entire mapping.

unicode_repr()¶: Produce a Giza-formatted string representing the alignment.

nltk.align.bleu_score module¶

BLEU score implementation.

nltk.align.bleu_score.bleu(candidate, references, weights)[source]¶

Calculate BLEU score (Bilingual Evaluation Understudy)

Parameters:	candidate (list(str)) – a candidate sentence references (list(list(str))) – reference sentences weights (list(float)) – weights for unigrams, bigrams, trigrams and so on

>>> weights = [0.25, 0.25, 0.25, 0.25]
>>> candidate1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']

>>> candidate2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']

>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will', 'forever',
...               'heed', 'Party', 'commands']

>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']

>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']

>>> bleu(candidate1, [reference1, reference2, reference3], weights)
0.504...

>>> bleu(candidate2, [reference1, reference2, reference3], weights)
0

Papineni, Kishore, et al. “BLEU: A method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. http://www.aclweb.org/anthology/P02-1040.pdf

nltk.align.gale_church module¶

A port of the Gale-Church Aligner.

Gale & Church (1993), A Program for Aligning Sentences in Bilingual Corpora. http://aclweb.org/anthology/J93-1004.pdf

class nltk.align.gale_church.LanguageIndependent[source]¶

Bases: object

AVERAGE_CHARACTERS = 1¶

PRIORS = {(0, 1): 0.0099, (1, 2): 0.089, (2, 2): 0.011, (1, 0): 0.0099, (1, 1): 0.89, (2, 1): 0.089}¶

VARIANCE_CHARACTERS = 6.8¶

nltk.align.gale_church.align_blocks(source_sents, target_sents, params=<class 'nltk.align.gale_church.LanguageIndependent'>)[source]¶

Return the sentence alignment of two text blocks (usually paragraphs).

>>> align_blocks([5,5,5], [7,7,7])
[(0, 0), (1, 1), (2, 2)]
>>> align_blocks([10,5,5], [12,20])
[(0, 0), (1, 1), (2, 1)]
>>> align_blocks([12,20], [10,5,5])
[(0, 0), (1, 1), (1, 2)]
>>> align_blocks([10,2,10,10,2,10], [12,3,20,3,12])
[(0, 0), (1, 1), (2, 2), (3, 2), (4, 3), (5, 4)]

@param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param params: the sentence alignment parameters. @return: The sentence alignments, a list of index pairs.

nltk.align.gale_church.align_log_prob(i, j, source_sents, target_sents, alignment, params)[source]¶

Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}.

@param i: The offset of the source sentence. @param j: The offset of the target sentence. @param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param alignment: The alignment type, a tuple of two integers. @param params: The sentence alignment parameters.

@returns: The log probability of a specific alignment between the two sentences, given the parameters.

nltk.align.gale_church.align_texts(source_blocks, target_blocks, params=<class 'nltk.align.gale_church.LanguageIndependent'>)[source]¶

Creates the sentence alignment of two texts.

Texts can consist of several blocks. Block boundaries cannot be crossed by sentence alignment links.

Each block consists of a list that contains the lengths (in characters) of the sentences in this block.

@param source_blocks: The list of blocks in the source text. @param target_blocks: The list of blocks in the target text. @param params: the sentence alignment parameters.

@returns: A list of sentence alignment lists

nltk.align.gale_church.erfcc(x)[source]¶: Complementary error function.

nltk.align.gale_church.norm_cdf(x)[source]¶: Return the area under the normal distribution from M{-∞..x}.

nltk.align.gale_church.norm_logsf(x)[source]¶

nltk.align.gale_church.parse_token_stream(stream, soft_delimiter, hard_delimiter)[source]¶: Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.

nltk.align.gale_church.split_at(it, split_value)[source]¶

Splits an iterator C{it} at values of C{split_value}.

Each instance of C{split_value} is swallowed. The iterator produces subiterators which need to be consumed fully before the next subiterator can be used.

nltk.align.gale_church.trace(backlinks, source, target)[source]¶

nltk.align.gdfa module¶

nltk.align.gdfa.grow_diag_final_and(srclen, trglen, e2f, f2e)[source]¶

This module symmetrisatizes the source-to-target and target-to-source word alignment output and produces, aka. GDFA algorithm (Koehn, 2005).

Step 1: Find the intersection of the bidirectional alignment.

Step 2: Search for additional neighbor alignment points to be added, given

these criteria: (i) neighbor alignments points are not in the intersection and (ii) neighbor alignments are in the union.

Step 3: Add all other alignment points thats not in the intersection, not in

the neighboring alignments that met the criteria but in the original foward/backward alignment outputs.

>>> forw = ('0-0 2-1 9-2 21-3 10-4 7-5 11-6 9-7 12-8 1-9 3-10 '
...         '4-11 17-12 17-13 25-14 13-15 24-16 11-17 28-18')
>>> back = ('0-0 1-9 2-9 3-10 4-11 5-12 6-6 7-5 8-6 9-7 10-4 '
...         '11-6 12-8 13-12 15-12 17-13 18-13 19-12 20-13 '
...         '21-3 22-12 23-14 24-17 25-15 26-17 27-18 28-18')
>>> srctext = ("この よう な ハロー 白色 わい 星 の Ｌ 関数 "
...            "は Ｌ と 共 に 不連続 に 増加 する こと が "
...            "期待 さ れる こと を 示し た 。")
>>> trgtext = ("Therefore , we expect that the luminosity function "
...            "of such halo white dwarfs increases discontinuously "
...            "with the luminosity .")
>>> srclen = len(srctext.split())
>>> trglen = len(trgtext.split())
>>>
>>> gdfa = grow_diag_final_and(srclen, trglen, forw, back)
>>> gdfa == set([(28, 18), (6, 6), (24, 17), (2, 1), (15, 12), (13, 12),
...         (2, 9), (3, 10), (26, 17), (25, 15), (8, 6), (9, 7), (20,
...         13), (18, 13), (0, 0), (10, 4), (13, 15), (23, 14), (7, 5),
...         (25, 14), (1, 9), (17, 13), (4, 11), (11, 17), (9, 2), (22,
...         12), (27, 18), (24, 16), (21, 3), (19, 12), (17, 12), (5,
...         12), (11, 6), (12, 8)])
True

References: Koehn, P., A. Axelrod, A. Birch, C. Callison, M. Osborne, and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. In MT Eval Workshop.

Parameters:	srclen (int) – the number of tokens in the source language trglen (int) – the number of tokens in the target language e2f (str) – the forward word alignment outputs from source-to-target language (in pharaoh output format) f2e (str) – the backward word alignment outputs from target-to-source language (in pharaoh output format)
Return type:	set(tuple(int))
Returns:	the symmetrized alignment points from the GDFA algorithm

nltk.align.ibm1 module¶

Lexical translation model that ignores word order.

In IBM Model 1, word order is ignored for simplicity. Thus, the following two alignments are equally likely.

Source: je mange du jambon Target: i eat some ham Alignment: (1,1) (2,2) (3,3) (4,4)

Source: je mange du jambon Target: some ham eat i Alignment: (1,4) (2,3) (3,2) (4,1)

The EM algorithm used in Model 1 is: E step - In the training data, count how many times a source language

word is translated into a target language word, weighted by the prior probability of the translation.

M step - Estimate the new probability of translation based on the: counts from the Expectation step.

Notations: i: Position in the source sentence

Valid values are 0 (for NULL), 1, 2, ..., length of source sentence

j: Position in the target sentence: Valid values are 1, 2, ..., length of target sentence

s: A word in the source language t: A word in the target language

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.

class nltk.align.ibm1.IBMModel1(sentence_aligned_corpus, iterations)[source]¶

Bases: nltk.align.ibm_model.IBMModel

Lexical translation model that ignores word order

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))

>>> ibm1 = IBMModel1(bitext, 5)

>>> print('{0:.3f}'.format(ibm1.translation_table['buch']['book']))
0.889
>>> print('{0:.3f}'.format(ibm1.translation_table['das']['book']))
0.062
>>> print('{0:.3f}'.format(ibm1.translation_table['buch'][None]))
0.113
>>> print('{0:.3f}'.format(ibm1.translation_table['ja'][None]))
0.073

>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)])

prob_t_a_given_s(alignment_info)[source]¶: Probability of target sentence and an alignment given the source sentence

train(parallel_corpus, iterations)[source]¶

nltk.align.ibm2 module¶

Lexical translation model that considers word order.

IBM Model 2 improves on Model 1 by accounting for word order. An alignment probability is introduced, a(i | j,l,m), which predicts a source word position, given its aligned target word’s position.

The EM algorithm used in Model 2 is: E step - In the training data, collect counts, weighted by prior

probabilities. (a) count how many times a source language word is translated

into a target language word

count how many times a particular position in the source sentence is aligned to a particular position in the target sentence

M step - Estimate new probabilities based on the counts from the E step

Notations: i: Position in the source sentence

Valid values are 0 (for NULL), 1, 2, ..., length of source sentence

j: Position in the target sentence: Valid values are 1, 2, ..., length of target sentence

l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

class nltk.align.ibm2.IBMModel2(sentence_aligned_corpus, iterations)[source]¶

Bases: nltk.align.ibm_model.IBMModel

Lexical translation model that considers word order

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))

>>> ibm2 = IBMModel2(bitext, 5)

>>> print('{0:.3f}'.format(ibm2.translation_table['buch']['book']))
1.000
>>> print('{0:.3f}'.format(ibm2.translation_table['das']['book']))
0.000
>>> print('{0:.3f}'.format(ibm2.translation_table['buch'][None]))
0.000
>>> print('{0:.3f}'.format(ibm2.translation_table['ja'][None]))
0.000

>>> print('{0:.3f}'.format(ibm2.alignment_table[1][1][2][2]))
0.939
>>> print('{0:.3f}'.format(ibm2.alignment_table[1][2][2][2]))
0.000
>>> print('{0:.3f}'.format(ibm2.alignment_table[2][2][4][5]))
1.000

>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)])

prob_t_a_given_s(alignment_info)[source]¶: Probability of target sentence and an alignment given the source sentence

train(parallel_corpus, iterations)[source]¶

nltk.align.ibm3 module¶

Translation model that considers how a word can be aligned to multiple words in another language.

IBM Model 3 improves on Model 2 by directly modeling the phenomenon where a word in one language may be translated into zero or more words in another. This is expressed by the fertility probability, n(phi | source word).

If a source word translates into more than one word, it is possible to generate sentences that have the same alignment in multiple ways. This is modeled by a distortion step. The distortion probability, d(j|i,l,m), predicts a target word position, given its aligned source word’s position. The distortion probability replaces the alignment probability of Model 2.

The fertility probability is not applicable for NULL. Target words that align to NULL are assumed to be distributed uniformly in the target sentence. The existence of these words is modeled by p1, the probability that a target word produced by a real source word requires another target word that is produced by NULL.

The EM algorithm used in Model 3 is: E step - In the training data, collect counts, weighted by prior

probabilities. (a) count how many times a source language word is translated

into a target language word

count how many times a particular position in the target sentence is aligned to a particular position in the source sentence

count how many times a source word is aligned to phi number of target words

count how many times NULL is aligned to a target word

M step - Estimate new probabilities based on the counts from the E step

Because there are too many possible alignments, only the most probable ones are considered. First, the best alignment is determined using prior probabilities. Then, a hill climbing approach is used to find other good candidates.

Notations: i: Position in the source sentence

Valid values are 0 (for NULL), 1, 2, ..., length of source sentence

j: Position in the target sentence: Valid values are 1, 2, ..., length of target sentence

l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language phi: Fertility, the number of target words produced by a source word p1: Probability that a target word produced by a source word is

accompanied by another target word that is aligned to NULL

p0: 1 - p1

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

class nltk.align.ibm3.IBMModel3(sentence_aligned_corpus, iterations)[source]¶

Bases: nltk.align.ibm_model.IBMModel

Translation model that considers how a word can be aligned to multiple words in another language

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
>>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))

>>> ibm3 = IBMModel3(bitext, 5)

>>> print('{0:.3f}'.format(ibm3.translation_table['buch']['book']))
1.000
>>> print('{0:.3f}'.format(ibm3.translation_table['das']['book']))
0.000
>>> print('{0:.3f}'.format(ibm3.translation_table['ja'][None]))
1.000

>>> print('{0:.3f}'.format(ibm3.distortion_table[1][1][2][2]))
1.000
>>> print('{0:.3f}'.format(ibm3.distortion_table[1][2][2][2]))
0.000
>>> print('{0:.3f}'.format(ibm3.distortion_table[2][2][4][5]))
0.750

>>> print('{0:.3f}'.format(ibm3.fertility_table[2]['summarize']))
1.000
>>> print('{0:.3f}'.format(ibm3.fertility_table[1]['book']))
1.000

>>> print('{0:.3f}'.format(ibm3.p1))
0.026

>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])

distortion_table = None¶: dict[int][int][int][int]: float. Probability(j | i,l,m). Values accessed as distortion_table[j][i][l][m].

prob_t_a_given_s(alignment_info)[source]¶: Probability of target sentence and an alignment given the source sentence

train(parallel_corpus, iterations)[source]¶

nltk.align.ibm4 module¶

Translation model that reorders output words based on their type and distance from other related words in the output sentence.

IBM Model 4 improves the distortion model of Model 3, motivated by the observation that certain words tend to be re-ordered in a predictable way relative to one another. For example, <adjective><noun> in English usually has its order flipped as <noun><adjective> in French.

Model 4 requires words in the source and target vocabularies to be categorized into classes. This can be linguistically driven, like parts of speech (adjective, nouns, prepositions, etc). Word classes can also be obtained by statistical methods. The original IBM Model 4 uses an information theoretic approach to group words into 50 classes for each vocabulary.

Terminology: Cept:

A source word with non-zero fertility i.e. aligned to one or more target words.

Tablet:: The set of target word(s) aligned to a cept.
Head of cept:: The first word of the tablet of that cept.
Center of cept:: The average position of the words in that cept’s tablet. If the value is not an integer, the ceiling is taken. For example, for a tablet with words in positions 2, 5, 6 in the target sentence, the center of the corresponding cept is ceil((2 + 5 + 6) / 3) = 5
Displacement:: For a head word, defined as (position of head word - position of previous cept’s center). Can be positive or negative. For a non-head word, defined as (position of non-head word - position of previous word in the same tablet). Always positive, because successive words in a tablet are assumed to appear to the right of the previous word.

In contrast to Model 3 which reorders words in a tablet independently of other words, Model 4 distinguishes between three cases. (1) Words generated by NULL are distributed uniformly. (2) For a head word t, its position is modeled by the probability

d_head(displacement | word_class_s(s),word_class_t(t)), where s is the previous cept, and word_class_s and word_class_t maps s and t to a source and target language word class respectively.

For a non-head word t, its position is modeled by the probability d_non_head(displacement | word_class_t(t))

The EM algorithm used in Model 4 is: E step - In the training data, collect counts, weighted by prior

probabilities. (a) count how many times a source language word is translated

into a target language word

for a particular word class, count how many times a head word is located at a particular displacement from the previous cept’s center

for a particular word class, count how many times a non-head word is located at a particular displacement from the previous target word

count how many times a source word is aligned to phi number of target words

count how many times NULL is aligned to a target word

M step - Estimate new probabilities based on the counts from the E step

Like Model 3, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates.

Notations: i: Position in the source sentence

Valid values are 0 (for NULL), 1, 2, ..., length of source sentence

j: Position in the target sentence: Valid values are 1, 2, ..., length of target sentence

accompanied by another target word that is aligned to NULL

p0: 1 - p1 dj: Displacement, Δj

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

class nltk.align.ibm4.IBMModel4(sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None)[source]¶

Bases: nltk.align.ibm_model.IBMModel

Translation model that reorders output words based on their type and their distance from other related words in the output sentence

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
>>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
>>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'i': 4, 'summarize': 5 }
>>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }

>>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes)

>>> print('{0:.3f}'.format(ibm4.translation_table['buch']['book']))
1.000
>>> print('{0:.3f}'.format(ibm4.translation_table['das']['book']))
0.000
>>> print('{0:.3f}'.format(ibm4.translation_table['ja'][None]))
1.000

>>> print('{0:.3f}'.format(ibm4.head_distortion_table[1][0][1]))
1.000
>>> print('{0:.3f}'.format(ibm4.head_distortion_table[2][0][1]))
0.000
>>> print('{0:.3f}'.format(ibm4.non_head_distortion_table[3][6]))
0.500

>>> print('{0:.3f}'.format(ibm4.fertility_table[2]['summarize']))
1.000
>>> print('{0:.3f}'.format(ibm4.fertility_table[1]['book']))
1.000

>>> print('{0:.3f}'.format(ibm4.p1))
0.033

>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])

maximize_distortion_probabilities(counts)[source]¶

static model4_prob_t_a_given_s(alignment_info, ibm_model)[source]¶

prob_t_a_given_s(alignment_info)[source]¶: Probability of target sentence and an alignment given the source sentence

reset_probabilities()[source]¶

set_uniform_distortion_probabilities(sentence_aligned_corpus)[source]¶: Set distortion probabilities uniformly to 1 / cardinality of displacement values

train(parallel_corpus)[source]¶

class nltk.align.ibm4.Model4Counts[source]¶

Bases: nltk.align.ibm_model.Counts

Data object to store counts of various parameters during training. Include counts for distortion.

update_distortion(count, alignment_info, j, src_classes, trg_classes)[source]¶

nltk.align.ibm5 module¶

Translation model that keeps track of vacant positions in the target sentence to decide where to place translated words.

Translation can be viewed as a process where each word in the source sentence is stepped through sequentially, generating translated words for each source word. The target sentence can be viewed as being made up of m empty slots initially, which gradually fill up as generated words are placed in them.

Models 3 and 4 use distortion probabilities to decide how to place translated words. For simplicity, these models ignore the history of which slots have already been occupied with translated words. Consider the placement of the last translated word: there is only one empty slot left in the target sentence, so the distortion probability should be 1.0 for that position and 0.0 everywhere else. However, the distortion probabilities for Models 3 and 4 are set up such that all positions are under consideration.

IBM Model 5 fixes this deficiency by accounting for occupied slots during translation. It introduces the vacancy function v(j), the number of vacancies up to, and including, position j in the target sentence.

Terminology: Maximum vacancy:

The number of valid slots that a word can be placed in. This is not necessarily the same as the number of vacant slots. For example, if a tablet contains more than one word, the head word cannot be placed at the last vacant slot because there will be no space for the other words in the tablet. The number of valid slots has to take into account the length of the tablet. Non-head words cannot be placed before the head word, so vacancies to the left of the head word are ignored.

Vacancy difference:: For a head word: (v(j) - v(center of previous cept)) Can be positive or negative. For a non-head word: (v(j) - v(position of previously placed word)) Always positive, because successive words in a tablet are assumed to appear to the right of the previous word.

Positioning of target words fall under three cases: (1) Words generated by NULL are distributed uniformly (2) For a head word t, its position is modeled by the probability

v_head(dv | max_v,word_class_t(t))

For a non-head word t, its position is modeled by the probability v_non_head(dv | max_v,word_class_t(t))

dv and max_v are defined differently for head and non-head words.

The EM algorithm used in Model 5 is: E step - In the training data, collect counts, weighted by prior

probabilities. (a) count how many times a source language word is translated

into a target language word

for a particular word class and maximum vacancy, count how many times a head word and the previous cept’s center have a particular difference in number of vacancies

for a particular word class and maximum vacancy, count how many times a non-head word and the previous target word have a particular difference in number of vacancies

count how many times a source word is aligned to phi number of target words

count how many times NULL is aligned to a target word

M step - Estimate new probabilities based on the counts from the E step

Like Model 4, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates. In addition, pruning is used to weed out unlikely alignments based on Model 4 scores.

Notations: i: Position in the source sentence

Valid values are 0 (for NULL), 1, 2, ..., length of source sentence

j: Position in the target sentence: Valid values are 1, 2, ..., length of target sentence

accompanied by another target word that is aligned to NULL

p0: 1 - p1 max_v: Maximum vacancy dv: Vacancy difference, Δv

The definition of v_head here differs from GIZA++, section 4.7 of [Brown et al., 1993], and [Koehn, 2010]. In the latter cases, v_head is v_head(v(j) | v(center of previous cept),max_v,word_class(t)).

Here, we follow appendix B of [Brown et al., 1993] and combine v(j) with v(center of previous cept) to obtain dv: v_head(v(j) - v(center of previous cept) | max_v,word_class(t)).

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

class nltk.align.ibm5.IBMModel5(sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None)[source]¶

Bases: nltk.align.ibm_model.IBMModel

Translation model that keeps track of vacant positions in the target sentence to decide where to place translated words

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
>>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
>>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'i': 4, 'summarize': 5 }
>>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }

>>> ibm5 = IBMModel5(bitext, 5, src_classes, trg_classes)

>>> print('{0:.3f}'.format(ibm5.head_vacancy_table[1][1][1]))
1.000
>>> print('{0:.3f}'.format(ibm5.head_vacancy_table[2][1][1]))
0.000
>>> print('{0:.3f}'.format(ibm5.non_head_vacancy_table[3][3][6]))
1.000

>>> print('{0:.3f}'.format(ibm5.fertility_table[2]['summarize']))
1.000
>>> print('{0:.3f}'.format(ibm5.fertility_table[1]['book']))
1.000

>>> print('{0:.3f}'.format(ibm5.p1))
0.033

>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])

MIN_SCORE_FACTOR = 0.2¶: Alignments with scores below this factor are pruned during sampling

hillclimb(alignment_info, j_pegged=None)[source]¶

Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one, according to Model 4

Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.

There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.

Parameters:	j_pegged (int) – If specified, the search will be constrained to alignments where `j_pegged` remains unchanged
Returns:	The best alignment found from hill climbing
Return type:	AlignmentInfo

maximize_vacancy_probabilities(counts)[source]¶

prob_t_a_given_s(alignment_info)[source]¶: Probability of target sentence and an alignment given the source sentence

prune(alignment_infos)[source]¶

Removes alignments from alignment_infos that have substantially lower Model 4 scores than the best alignment

Returns:	Pruned alignments
Return type:	set(AlignmentInfo)

reset_probabilities()[source]¶

sample(sentence_pair)[source]¶

Sample the most probable alignments from the entire alignment space according to Model 4

Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.

First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a IBM Model 4. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point. Finally, prune alignments that have substantially lower Model 4 scores than the best alignment.

Parameters:	sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from
Returns:	A set of best alignments represented by their `AlignmentInfo` and the best alignment of the set for convenience
Return type:	set(AlignmentInfo), AlignmentInfo

set_uniform_distortion_probabilities(sentence_aligned_corpus)[source]¶: Set vacancy probabilities uniformly to 1 / cardinality of vacancy difference values

train(parallel_corpus)[source]¶

class nltk.align.ibm5.Model5Counts[source]¶

Bases: nltk.align.ibm_model.Counts

Data object to store counts of various parameters during training. Include counts for vacancies.

update_vacancy(count, alignment_info, i, trg_classes, slots)[source]¶

Parameters:	count – Value to add to the vacancy counts alignment_info – Alignment under consideration i – Source word position under consideration trg_classes – Target word classes slots – Vacancy states of the slots in the target sentence. Output parameter that will be modified as new words are placed in the target sentence.

class nltk.align.ibm5.Slots(target_sentence_length)[source]¶

Bases: object

Represents positions in a target sentence. Used to keep track of which slot (position) is occupied.

occupy(position)[source]¶

Returns:	Mark slot at `position` as occupied

vacancies_at(position)[source]¶

Returns:	Number of vacant slots up to, and including, `position`

nltk.align.ibm_model module¶

Common methods and classes for all IBM models. See IBMModel1, IBMModel2, IBMModel3, IBMModel4, and IBMModel5 for specific implementations.

The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus.

The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables.

Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention.

Each target word is aligned to exactly one source word or the NULL token.

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

class nltk.align.ibm_model.AlignmentInfo(alignment, src_sentence, trg_sentence, cepts)[source]¶

Bases: object

Helper data object for training IBM Models 3 and up

Read-only. For a source sentence and its counterpart in the target language, this class holds information about the sentence pair’s alignment, cepts, and fertility.

Warning: Alignments are one-indexed here, in contrast to nltk.align.Alignment and nltk.align.AlignedSent, which are zero- indexed. This class is not meant to be used outside of IBM models.

alignment = None¶: tuple(int): Alignment function. alignment[j] is the position in the source sentence that is aligned to the position j in the target sentence.

center_of_cept(i)[source]¶

Returns:	The ceiling of the average positions of the words in the tablet of cept `i`, or 0 if `i` is None

cepts = None¶: list(list(int)): The positions of the target words, in ascending order, aligned to a source word position. For example, cepts[4] = (2, 3, 7) means that words in positions 2, 3 and 7 of the target sentence are aligned to the word in position 4 of the source sentence

fertility_of_i(i)[source]¶: Fertility of word in position i of the source sentence

is_head_word(j)[source]¶

Returns:	Whether the word in position `j` of the target sentence is a head word

previous_cept(j)[source]¶

Returns:	The previous cept of `j`, or None if `j` belongs to the first cept

previous_in_tablet(j)[source]¶

Returns:	The position of the previous word that is in the same tablet as `j`, or None if `j` is the first word of the tablet

score = None¶: float: Optional. Probability of alignment, as defined by the IBM model that assesses this alignment

src_sentence = None¶: tuple(str): Source sentence referred to by this object. Should include NULL token (None) in index 0.

trg_sentence = None¶: tuple(str): Target sentence referred to by this object. Should have a dummy element in index 0 so that the first word starts from index 1.

zero_indexed_alignment()[source]¶

Returns:	Zero-indexed alignment, suitable for use in external `nltk.align` modules like `nltk.align.Alignment`
Return type:	list(tuple)

class nltk.align.ibm_model.Counts[source]¶

Bases: object

Data object to store counts of various parameters during training

update_fertility(count, alignment_info)[source]¶

update_lexical_translation(count, alignment_info, j)[source]¶

update_null_generation(count, alignment_info)[source]¶

class nltk.align.ibm_model.IBMModel(sentence_aligned_corpus)[source]¶

Bases: object

Abstract base class for all IBM models

MIN_PROB = 1e-12¶

best_model2_alignment(sentence_pair, j_pegged=None, i_pegged=0)[source]¶

Finds the best alignment according to IBM Model 2

Used as a starting point for hill climbing in Models 3 and above, because it is easier to compute than the best alignments in higher models

Parameters:	sentence_pair (AlignedSent) – Source and target language sentence pair to be word-aligned j_pegged (int) – If specified, the alignment point of j_pegged will be fixed to i_pegged i_pegged (int) – Alignment point to j_pegged

hillclimb(alignment_info, j_pegged=None)[source]¶

Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one

There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.

Parameters:	j_pegged (int) – If specified, the search will be constrained to alignments where `j_pegged` remains unchanged
Returns:	The best alignment found from hill climbing
Return type:	AlignmentInfo

init_vocab(sentence_aligned_corpus)[source]¶

maximize_fertility_probabilities(counts)[source]¶

maximize_lexical_translation_probabilities(counts)[source]¶

maximize_null_generation_probabilities(counts)[source]¶

neighboring(alignment_info, j_pegged=None)[source]¶

Determine the neighbors of alignment_info, obtained by moving or swapping one alignment point

Parameters:	j_pegged (int) – If specified, neighbors that have a different alignment point from j_pegged will not be considered
Returns:	A set neighboring alignments represented by their `AlignmentInfo`
Return type:	set(AlignmentInfo)

prob_of_alignments(alignments)[source]¶

prob_t_a_given_s(alignment_info)[source]¶

Probability of target sentence and an alignment given the source sentence

All required information is assumed to be in alignment_info and self.

Derived classes should override this method

reset_probabilities()[source]¶

sample(sentence_pair)[source]¶

Sample the most probable alignments from the entire alignment space

First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a higher IBM Model. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point.

Hill climbing may be stuck in a local maxima, hence the pegging and trying out of different alignments.

Parameters:	sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from
Returns:	A set of best alignments represented by their `AlignmentInfo` and the best alignment of the set for convenience
Return type:	set(AlignmentInfo), AlignmentInfo

nltk.align.ibm_model.longest_target_sentence_length(sentence_aligned_corpus)[source]¶

Parameters:	sentence_aligned_corpus (list(AlignedSent)) – Parallel corpus under consideration
Returns:	Number of words in the longest target language sentence of `sentence_aligned_corpus`

nltk.align.phrase_based module¶

nltk.align.phrase_based.extract(f_start, f_end, e_start, e_end, alignment, e_aligned, f_aligned, srctext, trgtext, srclen, trglen)[source]¶

This function checks for alignment point consistency and extracts phrases using the chunk of consistent phrases.

A phrase pair (e, f ) is consistent with an alignment A if and only if:

No English words in the phrase pair are aligned to words outside it.

∀e i ∈ e, (e i , f j ) ∈ A ⇒ f j ∈ f
No Foreign words in the phrase pair are aligned to words outside it.

∀f j ∈ f , (e i , f j ) ∈ A ⇒ e i ∈ e
The phrase pair contains at least one alignment point.

∃e i ∈ e ̄ , f j ∈ f ̄ s.t. (e i , f j ) ∈ A

Parameters:

Parameters:	f_start (int) – Starting index of the possible foreign language phrases f_end (int) – Starting index of the possible foreign language phrases e_start (int) – Starting index of the possible source language phrases e_end (int) – Starting index of the possible source language phrases srctext (list) – The source language tokens, a list of string. trgtext (list) – The target language tokens, a list of string. srclen (int) – The number of tokens in the source language tokens. trglen (int) – The number of tokens in the target language tokens.

f_start (int) – Starting index of the possible foreign language phrases
f_end (int) – Starting index of the possible foreign language phrases
e_start (int) – Starting index of the possible source language phrases
e_end (int) – Starting index of the possible source language phrases
srctext (list) – The source language tokens, a list of string.
trgtext (list) – The target language tokens, a list of string.
srclen (int) – The number of tokens in the source language tokens.
trglen (int) – The number of tokens in the target language tokens.

nltk.align.phrase_based.phrase_extraction(srctext, trgtext, alignment)[source]¶

Phrase extraction algorithm extracts all consistent phrase pairs from a word-aligned sentence pair.

The idea is to loop over all possible source language (e) phrases and find the minimal foreign phrase (f) that matches each of them. Matching is done by identifying all alignment points for the source phrase and finding the shortest foreign phrase that includes all the foreign counterparts for the source words.

In short, a phrase alignment has to (a) contain all alignment points for all covered words (b) contain at least one alignment point

>>> srctext = "michael assumes that he will stay in the house"
>>> trgtext = "michael geht davon aus , dass er im haus bleibt"
>>> alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), 
... (5,9), (6,7), (7,7), (8,8)]
>>> phrases = phrase_extraction(srctext, trgtext, alignment)
>>> for i in sorted(phrases):
...    print(i)
...
((0, 1), (0, 1), 'michael', 'michael')
((0, 2), (0, 4), 'michael assumes', 'michael geht davon aus')
((0, 2), (0, 4), 'michael assumes', 'michael geht davon aus ,')
((0, 3), (0, 6), 'michael assumes that', 'michael geht davon aus , dass')
((0, 4), (0, 7), 'michael assumes that he', 'michael geht davon aus , dass er')
((0, 9), (0, 10), 'michael assumes that he will stay in the house', 'michael geht davon aus , dass er im haus bleibt')
((1, 2), (1, 4), 'assumes', 'geht davon aus')
((1, 2), (1, 4), 'assumes', 'geht davon aus ,')
((1, 3), (1, 6), 'assumes that', 'geht davon aus , dass')
((1, 4), (1, 7), 'assumes that he', 'geht davon aus , dass er')
((1, 9), (1, 10), 'assumes that he will stay in the house', 'geht davon aus , dass er im haus bleibt')
((2, 3), (5, 6), 'that', ', dass')
((2, 3), (5, 6), 'that', 'dass')
((2, 4), (5, 7), 'that he', ', dass er')
((2, 4), (5, 7), 'that he', 'dass er')
((2, 9), (5, 10), 'that he will stay in the house', ', dass er im haus bleibt')
((2, 9), (5, 10), 'that he will stay in the house', 'dass er im haus bleibt')
((3, 4), (6, 7), 'he', 'er')
((3, 9), (6, 10), 'he will stay in the house', 'er im haus bleibt')
((4, 6), (9, 10), 'will stay', 'bleibt')
((4, 9), (7, 10), 'will stay in the house', 'im haus bleibt')
((6, 8), (7, 8), 'in the', 'im')
((6, 9), (7, 9), 'in the house', 'im haus')
((8, 9), (8, 9), 'house', 'haus')

Parameters:	srctext (str) – The sentence string from the source language. trgtext (str) – The sentence string from the target language. alignment (str) – The word alignment outputs as list of tuples, where

the first elements of tuples are the source words’ indices and second elements are the target words’ indices. This is also the output format of nltk/align/ibm1.py

Return type:	list(tuple)
Returns:	A list of tuples, each element in a list is a phrase and each

phrase is a tuple made up of (i) its source location, (ii) its target location, (iii) the source phrase and (iii) the target phrase. The phrase list of tuples represents all the possible phrases extracted from the word alignments.

nltk.align.util module¶

Module contents¶

Experimental functionality for bitext alignment. These interfaces are prone to change.

nltk.align package¶

Submodules¶

nltk.align.api module¶

nltk.align.bleu_score module¶

nltk.align.gale_church module¶

nltk.align.gdfa module¶

nltk.align.ibm1 module¶

nltk.align.ibm2 module¶

nltk.align.ibm3 module¶

nltk.align.ibm4 module¶

nltk.align.ibm5 module¶

nltk.align.ibm_model module¶

nltk.align.phrase_based module¶

nltk.align.util module¶

Module contents¶

Table Of Contents

Search