nltk.align package¶
Submodules¶
nltk.align.api module¶
-
class
nltk.align.api.
AlignedSent
(words=[], mots=[], alignment='', encoding='utf8')[source]¶ Bases:
object
Return an aligned sentence object, which encapsulates two sentences along with an
Alignment
between them.>>> from nltk.align import AlignedSent >>> algnsent = AlignedSent(['klein', 'ist', 'das', 'Haus'], ... ['the', 'house', 'is', 'small'], '0-2 1-3 2-1 3-0') >>> algnsent.words ['klein', 'ist', 'das', 'Haus'] >>> algnsent.mots ['the', 'house', 'is', 'small'] >>> algnsent.alignment Alignment([(0, 2), (1, 3), (2, 1), (3, 0)]) >>> algnsent.precision('0-2 1-3 2-1 3-3') 0.75 >>> from nltk.corpus import comtrans >>> print(comtrans.aligned_sents()[54]) <AlignedSent: 'Weshalb also sollten...' -> 'So why should EU arm...'> >>> print(comtrans.aligned_sents()[54].alignment) 0-0 0-1 1-0 2-2 3-4 3-5 4-7 5-8 6-3 7-9 8-9 9-10 9-11 10-12 11-6 12-6 13-13
Parameters: - words (list(str)) – source language words
- mots (list(str)) – target language words
- alignment (Alignment) – the word-level alignments between the source and target language
-
alignment
¶
-
alignment_error_rate
(reference, possible=None)[source]¶ Return the Alignment Error Rate (AER) of an aligned sentence with respect to a “gold standard” reference
AlignedSent
.Return an error rate between 0.0 (perfect alignment) and 1.0 (no alignment).
>>> from nltk.align import AlignedSent >>> s = AlignedSent(["the", "cat"], ["le", "chat"], [(0, 0), (1, 1)]) >>> s.alignment_error_rate(s) 0.0
Parameters: - reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence.
- possible (AlignedSent or Alignment or None) – A “gold standard” reference of possible alignments (defaults to reference if None)
Return type: float or None
-
invert
()[source]¶ Return the aligned sentence pair, reversing the directionality
Return type: AlignedSent
-
mots
¶
-
precision
(reference)[source]¶ Return the precision of an aligned sentence with respect to a “gold standard” reference
AlignedSent
.Parameters: reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence. Return type: float or None
-
recall
(reference)[source]¶ Return the recall of an aligned sentence with respect to a “gold standard” reference
AlignedSent
.Parameters: reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence. Return type: float or None
-
words
¶
-
class
nltk.align.api.
Alignment
[source]¶ Bases:
frozenset
A storage class for representing alignment between two sequences, s1, s2. In general, an alignment is a set of tuples of the form (i, j, ...) representing an alignment between the i-th element of s1 and the j-th element of s2. Tuples are extensible (they might contain additional data, such as a boolean to indicate sure vs possible alignments).
>>> from nltk.align import Alignment >>> a = Alignment([(0, 0), (0, 1), (1, 2), (2, 2)]) >>> a.invert() Alignment([(0, 0), (1, 0), (2, 1), (2, 2)]) >>> print(a.invert()) 0-0 1-0 2-1 2-2 >>> a[0] [(0, 1), (0, 0)] >>> a.invert()[2] [(2, 1), (2, 2)] >>> b = Alignment([(0, 0), (0, 1)]) >>> b.issubset(a) True >>> c = Alignment('0-0 0-1') >>> b == c True
-
range
(positions=None)[source]¶ Work out the range of the mapping from the given positions. If no positions are specified, compute the range of the entire mapping.
-
unicode_repr
()¶ Produce a Giza-formatted string representing the alignment.
-
nltk.align.bleu_score module¶
BLEU score implementation.
-
nltk.align.bleu_score.
bleu
(candidate, references, weights)[source]¶ Calculate BLEU score (Bilingual Evaluation Understudy)
Parameters: - candidate (list(str)) – a candidate sentence
- references (list(list(str))) – reference sentences
- weights (list(float)) – weights for unigrams, bigrams, trigrams and so on
>>> weights = [0.25, 0.25, 0.25, 0.25] >>> candidate1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', ... 'ensures', 'that', 'the', 'military', 'always', ... 'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> candidate2 = ['It', 'is', 'to', 'insure', 'the', 'troops', ... 'forever', 'hearing', 'the', 'activity', 'guidebook', ... 'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', 'forever', ... 'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', ... 'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party']
>>> bleu(candidate1, [reference1, reference2, reference3], weights) 0.504...
>>> bleu(candidate2, [reference1, reference2, reference3], weights) 0
Papineni, Kishore, et al. “BLEU: A method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. http://www.aclweb.org/anthology/P02-1040.pdf
nltk.align.gale_church module¶
A port of the Gale-Church Aligner.
Gale & Church (1993), A Program for Aligning Sentences in Bilingual Corpora. http://aclweb.org/anthology/J93-1004.pdf
-
class
nltk.align.gale_church.
LanguageIndependent
[source]¶ Bases:
object
-
AVERAGE_CHARACTERS
= 1¶
-
PRIORS
= {(0, 1): 0.0099, (1, 2): 0.089, (2, 2): 0.011, (1, 0): 0.0099, (1, 1): 0.89, (2, 1): 0.089}¶
-
VARIANCE_CHARACTERS
= 6.8¶
-
-
nltk.align.gale_church.
align_blocks
(source_sents, target_sents, params=<class 'nltk.align.gale_church.LanguageIndependent'>)[source]¶ Return the sentence alignment of two text blocks (usually paragraphs).
>>> align_blocks([5,5,5], [7,7,7]) [(0, 0), (1, 1), (2, 2)] >>> align_blocks([10,5,5], [12,20]) [(0, 0), (1, 1), (2, 1)] >>> align_blocks([12,20], [10,5,5]) [(0, 0), (1, 1), (1, 2)] >>> align_blocks([10,2,10,10,2,10], [12,3,20,3,12]) [(0, 0), (1, 1), (2, 2), (3, 2), (4, 3), (5, 4)]
@param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param params: the sentence alignment parameters. @return: The sentence alignments, a list of index pairs.
-
nltk.align.gale_church.
align_log_prob
(i, j, source_sents, target_sents, alignment, params)[source]¶ Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}.
@param i: The offset of the source sentence. @param j: The offset of the target sentence. @param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param alignment: The alignment type, a tuple of two integers. @param params: The sentence alignment parameters.
@returns: The log probability of a specific alignment between the two sentences, given the parameters.
-
nltk.align.gale_church.
align_texts
(source_blocks, target_blocks, params=<class 'nltk.align.gale_church.LanguageIndependent'>)[source]¶ Creates the sentence alignment of two texts.
Texts can consist of several blocks. Block boundaries cannot be crossed by sentence alignment links.
Each block consists of a list that contains the lengths (in characters) of the sentences in this block.
@param source_blocks: The list of blocks in the source text. @param target_blocks: The list of blocks in the target text. @param params: the sentence alignment parameters.
@returns: A list of sentence alignment lists
-
nltk.align.gale_church.
norm_cdf
(x)[source]¶ Return the area under the normal distribution from M{-∞..x}.
-
nltk.align.gale_church.
parse_token_stream
(stream, soft_delimiter, hard_delimiter)[source]¶ Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
nltk.align.gdfa module¶
-
nltk.align.gdfa.
grow_diag_final_and
(srclen, trglen, e2f, f2e)[source]¶ This module symmetrisatizes the source-to-target and target-to-source word alignment output and produces, aka. GDFA algorithm (Koehn, 2005).
Step 1: Find the intersection of the bidirectional alignment.
- Step 2: Search for additional neighbor alignment points to be added, given
- these criteria: (i) neighbor alignments points are not in the intersection and (ii) neighbor alignments are in the union.
- Step 3: Add all other alignment points thats not in the intersection, not in
- the neighboring alignments that met the criteria but in the original foward/backward alignment outputs.
>>> forw = ('0-0 2-1 9-2 21-3 10-4 7-5 11-6 9-7 12-8 1-9 3-10 ' ... '4-11 17-12 17-13 25-14 13-15 24-16 11-17 28-18') >>> back = ('0-0 1-9 2-9 3-10 4-11 5-12 6-6 7-5 8-6 9-7 10-4 ' ... '11-6 12-8 13-12 15-12 17-13 18-13 19-12 20-13 ' ... '21-3 22-12 23-14 24-17 25-15 26-17 27-18 28-18') >>> srctext = ("この よう な ハロー 白色 わい 星 の L 関数 " ... "は L と 共 に 不連続 に 増加 する こと が " ... "期待 さ れる こと を 示し た 。") >>> trgtext = ("Therefore , we expect that the luminosity function " ... "of such halo white dwarfs increases discontinuously " ... "with the luminosity .") >>> srclen = len(srctext.split()) >>> trglen = len(trgtext.split()) >>> >>> gdfa = grow_diag_final_and(srclen, trglen, forw, back) >>> gdfa == set([(28, 18), (6, 6), (24, 17), (2, 1), (15, 12), (13, 12), ... (2, 9), (3, 10), (26, 17), (25, 15), (8, 6), (9, 7), (20, ... 13), (18, 13), (0, 0), (10, 4), (13, 15), (23, 14), (7, 5), ... (25, 14), (1, 9), (17, 13), (4, 11), (11, 17), (9, 2), (22, ... 12), (27, 18), (24, 16), (21, 3), (19, 12), (17, 12), (5, ... 12), (11, 6), (12, 8)]) True
References: Koehn, P., A. Axelrod, A. Birch, C. Callison, M. Osborne, and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. In MT Eval Workshop.
Parameters: - srclen (int) – the number of tokens in the source language
- trglen (int) – the number of tokens in the target language
- e2f (str) – the forward word alignment outputs from source-to-target language (in pharaoh output format)
- f2e (str) – the backward word alignment outputs from target-to-source language (in pharaoh output format)
Return type: set(tuple(int))
Returns: the symmetrized alignment points from the GDFA algorithm
nltk.align.ibm1 module¶
Lexical translation model that ignores word order.
In IBM Model 1, word order is ignored for simplicity. Thus, the following two alignments are equally likely.
Source: je mange du jambon Target: i eat some ham Alignment: (1,1) (2,2) (3,3) (4,4)
Source: je mange du jambon Target: some ham eat i Alignment: (1,4) (2,3) (3,2) (4,1)
The EM algorithm used in Model 1 is: E step - In the training data, count how many times a source language
word is translated into a target language word, weighted by the prior probability of the translation.
- M step - Estimate the new probability of translation based on the
- counts from the Expectation step.
Notations: i: Position in the source sentence
Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
- j: Position in the target sentence
- Valid values are 1, 2, ..., length of target sentence
s: A word in the source language t: A word in the target language
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
-
class
nltk.align.ibm1.
IBMModel1
(sentence_aligned_corpus, iterations)[source]¶ Bases:
nltk.align.ibm_model.IBMModel
Lexical translation model that ignores word order
>>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> ibm1 = IBMModel1(bitext, 5)
>>> print('{0:.3f}'.format(ibm1.translation_table['buch']['book'])) 0.889 >>> print('{0:.3f}'.format(ibm1.translation_table['das']['book'])) 0.062 >>> print('{0:.3f}'.format(ibm1.translation_table['buch'][None])) 0.113 >>> print('{0:.3f}'.format(ibm1.translation_table['ja'][None])) 0.073
>>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)])
nltk.align.ibm2 module¶
Lexical translation model that considers word order.
IBM Model 2 improves on Model 1 by accounting for word order. An alignment probability is introduced, a(i | j,l,m), which predicts a source word position, given its aligned target word’s position.
The EM algorithm used in Model 2 is: E step - In the training data, collect counts, weighted by prior
probabilities. (a) count how many times a source language word is translated
into a target language word
- count how many times a particular position in the source sentence is aligned to a particular position in the target sentence
M step - Estimate new probabilities based on the counts from the E step
Notations: i: Position in the source sentence
Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
- j: Position in the target sentence
- Valid values are 1, 2, ..., length of target sentence
l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
-
class
nltk.align.ibm2.
IBMModel2
(sentence_aligned_corpus, iterations)[source]¶ Bases:
nltk.align.ibm_model.IBMModel
Lexical translation model that considers word order
>>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> ibm2 = IBMModel2(bitext, 5)
>>> print('{0:.3f}'.format(ibm2.translation_table['buch']['book'])) 1.000 >>> print('{0:.3f}'.format(ibm2.translation_table['das']['book'])) 0.000 >>> print('{0:.3f}'.format(ibm2.translation_table['buch'][None])) 0.000 >>> print('{0:.3f}'.format(ibm2.translation_table['ja'][None])) 0.000
>>> print('{0:.3f}'.format(ibm2.alignment_table[1][1][2][2])) 0.939 >>> print('{0:.3f}'.format(ibm2.alignment_table[1][2][2][2])) 0.000 >>> print('{0:.3f}'.format(ibm2.alignment_table[2][2][4][5])) 1.000
>>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)])
nltk.align.ibm3 module¶
Translation model that considers how a word can be aligned to multiple words in another language.
IBM Model 3 improves on Model 2 by directly modeling the phenomenon where a word in one language may be translated into zero or more words in another. This is expressed by the fertility probability, n(phi | source word).
If a source word translates into more than one word, it is possible to generate sentences that have the same alignment in multiple ways. This is modeled by a distortion step. The distortion probability, d(j|i,l,m), predicts a target word position, given its aligned source word’s position. The distortion probability replaces the alignment probability of Model 2.
The fertility probability is not applicable for NULL. Target words that align to NULL are assumed to be distributed uniformly in the target sentence. The existence of these words is modeled by p1, the probability that a target word produced by a real source word requires another target word that is produced by NULL.
The EM algorithm used in Model 3 is: E step - In the training data, collect counts, weighted by prior
probabilities. (a) count how many times a source language word is translated
into a target language word
- count how many times a particular position in the target sentence is aligned to a particular position in the source sentence
- count how many times a source word is aligned to phi number of target words
- count how many times NULL is aligned to a target word
M step - Estimate new probabilities based on the counts from the E step
Because there are too many possible alignments, only the most probable ones are considered. First, the best alignment is determined using prior probabilities. Then, a hill climbing approach is used to find other good candidates.
Notations: i: Position in the source sentence
Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
- j: Position in the target sentence
- Valid values are 1, 2, ..., length of target sentence
l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language phi: Fertility, the number of target words produced by a source word p1: Probability that a target word produced by a source word is
accompanied by another target word that is aligned to NULL
p0: 1 - p1
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
-
class
nltk.align.ibm3.
IBMModel3
(sentence_aligned_corpus, iterations)[source]¶ Bases:
nltk.align.ibm_model.IBMModel
Translation model that considers how a word can be aligned to multiple words in another language
>>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book'])) >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
>>> ibm3 = IBMModel3(bitext, 5)
>>> print('{0:.3f}'.format(ibm3.translation_table['buch']['book'])) 1.000 >>> print('{0:.3f}'.format(ibm3.translation_table['das']['book'])) 0.000 >>> print('{0:.3f}'.format(ibm3.translation_table['ja'][None])) 1.000
>>> print('{0:.3f}'.format(ibm3.distortion_table[1][1][2][2])) 1.000 >>> print('{0:.3f}'.format(ibm3.distortion_table[1][2][2][2])) 0.000 >>> print('{0:.3f}'.format(ibm3.distortion_table[2][2][4][5])) 0.750
>>> print('{0:.3f}'.format(ibm3.fertility_table[2]['summarize'])) 1.000 >>> print('{0:.3f}'.format(ibm3.fertility_table[1]['book'])) 1.000
>>> print('{0:.3f}'.format(ibm3.p1)) 0.026
>>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
-
distortion_table
= None¶ dict[int][int][int][int]: float. Probability(j | i,l,m). Values accessed as
distortion_table[j][i][l][m]
.
-
nltk.align.ibm4 module¶
Translation model that reorders output words based on their type and distance from other related words in the output sentence.
IBM Model 4 improves the distortion model of Model 3, motivated by the observation that certain words tend to be re-ordered in a predictable way relative to one another. For example, <adjective><noun> in English usually has its order flipped as <noun><adjective> in French.
Model 4 requires words in the source and target vocabularies to be categorized into classes. This can be linguistically driven, like parts of speech (adjective, nouns, prepositions, etc). Word classes can also be obtained by statistical methods. The original IBM Model 4 uses an information theoretic approach to group words into 50 classes for each vocabulary.
Terminology: Cept:
A source word with non-zero fertility i.e. aligned to one or more target words.
- Tablet:
- The set of target word(s) aligned to a cept.
- Head of cept:
- The first word of the tablet of that cept.
- Center of cept:
- The average position of the words in that cept’s tablet. If the value is not an integer, the ceiling is taken. For example, for a tablet with words in positions 2, 5, 6 in the target sentence, the center of the corresponding cept is ceil((2 + 5 + 6) / 3) = 5
- Displacement:
- For a head word, defined as (position of head word - position of previous cept’s center). Can be positive or negative. For a non-head word, defined as (position of non-head word - position of previous word in the same tablet). Always positive, because successive words in a tablet are assumed to appear to the right of the previous word.
In contrast to Model 3 which reorders words in a tablet independently of other words, Model 4 distinguishes between three cases. (1) Words generated by NULL are distributed uniformly. (2) For a head word t, its position is modeled by the probability
d_head(displacement | word_class_s(s),word_class_t(t)), where s is the previous cept, and word_class_s and word_class_t maps s and t to a source and target language word class respectively.
- For a non-head word t, its position is modeled by the probability d_non_head(displacement | word_class_t(t))
The EM algorithm used in Model 4 is: E step - In the training data, collect counts, weighted by prior
probabilities. (a) count how many times a source language word is translated
into a target language word
- for a particular word class, count how many times a head word is located at a particular displacement from the previous cept’s center
- for a particular word class, count how many times a non-head word is located at a particular displacement from the previous target word
- count how many times a source word is aligned to phi number of target words
- count how many times NULL is aligned to a target word
M step - Estimate new probabilities based on the counts from the E step
Like Model 3, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates.
Notations: i: Position in the source sentence
Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
- j: Position in the target sentence
- Valid values are 1, 2, ..., length of target sentence
l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language phi: Fertility, the number of target words produced by a source word p1: Probability that a target word produced by a source word is
accompanied by another target word that is aligned to NULL
p0: 1 - p1 dj: Displacement, Δj
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
-
class
nltk.align.ibm4.
IBMModel4
(sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None)[source]¶ Bases:
nltk.align.ibm_model.IBMModel
Translation model that reorders output words based on their type and their distance from other related words in the output sentence
>>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book'])) >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize'])) >>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'i': 4, 'summarize': 5 } >>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }
>>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes)
>>> print('{0:.3f}'.format(ibm4.translation_table['buch']['book'])) 1.000 >>> print('{0:.3f}'.format(ibm4.translation_table['das']['book'])) 0.000 >>> print('{0:.3f}'.format(ibm4.translation_table['ja'][None])) 1.000
>>> print('{0:.3f}'.format(ibm4.head_distortion_table[1][0][1])) 1.000 >>> print('{0:.3f}'.format(ibm4.head_distortion_table[2][0][1])) 0.000 >>> print('{0:.3f}'.format(ibm4.non_head_distortion_table[3][6])) 0.500
>>> print('{0:.3f}'.format(ibm4.fertility_table[2]['summarize'])) 1.000 >>> print('{0:.3f}'.format(ibm4.fertility_table[1]['book'])) 1.000
>>> print('{0:.3f}'.format(ibm4.p1)) 0.033
>>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
-
prob_t_a_given_s
(alignment_info)[source]¶ Probability of target sentence and an alignment given the source sentence
-
-
class
nltk.align.ibm4.
Model4Counts
[source]¶ Bases:
nltk.align.ibm_model.Counts
Data object to store counts of various parameters during training. Include counts for distortion.
nltk.align.ibm5 module¶
Translation model that keeps track of vacant positions in the target sentence to decide where to place translated words.
Translation can be viewed as a process where each word in the source
sentence is stepped through sequentially, generating translated words
for each source word. The target sentence can be viewed as being made
up of m
empty slots initially, which gradually fill up as generated
words are placed in them.
Models 3 and 4 use distortion probabilities to decide how to place translated words. For simplicity, these models ignore the history of which slots have already been occupied with translated words. Consider the placement of the last translated word: there is only one empty slot left in the target sentence, so the distortion probability should be 1.0 for that position and 0.0 everywhere else. However, the distortion probabilities for Models 3 and 4 are set up such that all positions are under consideration.
IBM Model 5 fixes this deficiency by accounting for occupied slots during translation. It introduces the vacancy function v(j), the number of vacancies up to, and including, position j in the target sentence.
Terminology: Maximum vacancy:
The number of valid slots that a word can be placed in. This is not necessarily the same as the number of vacant slots. For example, if a tablet contains more than one word, the head word cannot be placed at the last vacant slot because there will be no space for the other words in the tablet. The number of valid slots has to take into account the length of the tablet. Non-head words cannot be placed before the head word, so vacancies to the left of the head word are ignored.
- Vacancy difference:
- For a head word: (v(j) - v(center of previous cept)) Can be positive or negative. For a non-head word: (v(j) - v(position of previously placed word)) Always positive, because successive words in a tablet are assumed to appear to the right of the previous word.
Positioning of target words fall under three cases: (1) Words generated by NULL are distributed uniformly (2) For a head word t, its position is modeled by the probability
v_head(dv | max_v,word_class_t(t))
- For a non-head word t, its position is modeled by the probability v_non_head(dv | max_v,word_class_t(t))
dv and max_v are defined differently for head and non-head words.
The EM algorithm used in Model 5 is: E step - In the training data, collect counts, weighted by prior
probabilities. (a) count how many times a source language word is translated
into a target language word
- for a particular word class and maximum vacancy, count how many times a head word and the previous cept’s center have a particular difference in number of vacancies
- for a particular word class and maximum vacancy, count how many times a non-head word and the previous target word have a particular difference in number of vacancies
- count how many times a source word is aligned to phi number of target words
- count how many times NULL is aligned to a target word
M step - Estimate new probabilities based on the counts from the E step
Like Model 4, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates. In addition, pruning is used to weed out unlikely alignments based on Model 4 scores.
Notations: i: Position in the source sentence
Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
- j: Position in the target sentence
- Valid values are 1, 2, ..., length of target sentence
l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language phi: Fertility, the number of target words produced by a source word p1: Probability that a target word produced by a source word is
accompanied by another target word that is aligned to NULL
p0: 1 - p1 max_v: Maximum vacancy dv: Vacancy difference, Δv
The definition of v_head here differs from GIZA++, section 4.7 of [Brown et al., 1993], and [Koehn, 2010]. In the latter cases, v_head is v_head(v(j) | v(center of previous cept),max_v,word_class(t)).
Here, we follow appendix B of [Brown et al., 1993] and combine v(j) with v(center of previous cept) to obtain dv: v_head(v(j) - v(center of previous cept) | max_v,word_class(t)).
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
-
class
nltk.align.ibm5.
IBMModel5
(sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None)[source]¶ Bases:
nltk.align.ibm_model.IBMModel
Translation model that keeps track of vacant positions in the target sentence to decide where to place translated words
>>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book'])) >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize'])) >>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'i': 4, 'summarize': 5 } >>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }
>>> ibm5 = IBMModel5(bitext, 5, src_classes, trg_classes)
>>> print('{0:.3f}'.format(ibm5.head_vacancy_table[1][1][1])) 1.000 >>> print('{0:.3f}'.format(ibm5.head_vacancy_table[2][1][1])) 0.000 >>> print('{0:.3f}'.format(ibm5.non_head_vacancy_table[3][3][6])) 1.000
>>> print('{0:.3f}'.format(ibm5.fertility_table[2]['summarize'])) 1.000 >>> print('{0:.3f}'.format(ibm5.fertility_table[1]['book'])) 1.000
>>> print('{0:.3f}'.format(ibm5.p1)) 0.033
>>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
-
MIN_SCORE_FACTOR
= 0.2¶ Alignments with scores below this factor are pruned during sampling
-
hillclimb
(alignment_info, j_pegged=None)[source]¶ Starting from the alignment in
alignment_info
, look at neighboring alignments iteratively for the best one, according to Model 4Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.
There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.
Parameters: j_pegged (int) – If specified, the search will be constrained to alignments where j_pegged
remains unchangedReturns: The best alignment found from hill climbing Return type: AlignmentInfo
-
prob_t_a_given_s
(alignment_info)[source]¶ Probability of target sentence and an alignment given the source sentence
-
prune
(alignment_infos)[source]¶ Removes alignments from
alignment_infos
that have substantially lower Model 4 scores than the best alignmentReturns: Pruned alignments Return type: set(AlignmentInfo)
-
sample
(sentence_pair)[source]¶ Sample the most probable alignments from the entire alignment space according to Model 4
Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.
First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a IBM Model 4. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point. Finally, prune alignments that have substantially lower Model 4 scores than the best alignment.
Parameters: sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from Returns: A set of best alignments represented by their AlignmentInfo
and the best alignment of the set for convenienceReturn type: set(AlignmentInfo), AlignmentInfo
-
-
class
nltk.align.ibm5.
Model5Counts
[source]¶ Bases:
nltk.align.ibm_model.Counts
Data object to store counts of various parameters during training. Include counts for vacancies.
-
update_vacancy
(count, alignment_info, i, trg_classes, slots)[source]¶ Parameters: - count – Value to add to the vacancy counts
- alignment_info – Alignment under consideration
- i – Source word position under consideration
- trg_classes – Target word classes
- slots – Vacancy states of the slots in the target sentence. Output parameter that will be modified as new words are placed in the target sentence.
-
nltk.align.ibm_model module¶
Common methods and classes for all IBM models. See IBMModel1
,
IBMModel2
, IBMModel3
, IBMModel4
, and IBMModel5
for specific implementations.
The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus.
The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables.
Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention.
Each target word is aligned to exactly one source word or the NULL token.
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
-
class
nltk.align.ibm_model.
AlignmentInfo
(alignment, src_sentence, trg_sentence, cepts)[source]¶ Bases:
object
Helper data object for training IBM Models 3 and up
Read-only. For a source sentence and its counterpart in the target language, this class holds information about the sentence pair’s alignment, cepts, and fertility.
Warning: Alignments are one-indexed here, in contrast to nltk.align.Alignment and nltk.align.AlignedSent, which are zero- indexed. This class is not meant to be used outside of IBM models.
-
alignment
= None¶ tuple(int): Alignment function.
alignment[j]
is the position in the source sentence that is aligned to the position j in the target sentence.
-
center_of_cept
(i)[source]¶ Returns: The ceiling of the average positions of the words in the tablet of cept i
, or 0 ifi
is None
-
cepts
= None¶ list(list(int)): The positions of the target words, in ascending order, aligned to a source word position. For example, cepts[4] = (2, 3, 7) means that words in positions 2, 3 and 7 of the target sentence are aligned to the word in position 4 of the source sentence
-
is_head_word
(j)[source]¶ Returns: Whether the word in position j
of the target sentence is a head word
-
previous_in_tablet
(j)[source]¶ Returns: The position of the previous word that is in the same tablet as j
, or None ifj
is the first word of the tablet
-
score
= None¶ float: Optional. Probability of alignment, as defined by the IBM model that assesses this alignment
-
src_sentence
= None¶ tuple(str): Source sentence referred to by this object. Should include NULL token (None) in index 0.
-
trg_sentence
= None¶ tuple(str): Target sentence referred to by this object. Should have a dummy element in index 0 so that the first word starts from index 1.
-
-
class
nltk.align.ibm_model.
Counts
[source]¶ Bases:
object
Data object to store counts of various parameters during training
-
class
nltk.align.ibm_model.
IBMModel
(sentence_aligned_corpus)[source]¶ Bases:
object
Abstract base class for all IBM models
-
MIN_PROB
= 1e-12¶
-
best_model2_alignment
(sentence_pair, j_pegged=None, i_pegged=0)[source]¶ Finds the best alignment according to IBM Model 2
Used as a starting point for hill climbing in Models 3 and above, because it is easier to compute than the best alignments in higher models
Parameters: - sentence_pair (AlignedSent) – Source and target language sentence pair to be word-aligned
- j_pegged (int) – If specified, the alignment point of j_pegged will be fixed to i_pegged
- i_pegged (int) – Alignment point to j_pegged
-
hillclimb
(alignment_info, j_pegged=None)[source]¶ Starting from the alignment in
alignment_info
, look at neighboring alignments iteratively for the best oneThere is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.
Parameters: j_pegged (int) – If specified, the search will be constrained to alignments where j_pegged
remains unchangedReturns: The best alignment found from hill climbing Return type: AlignmentInfo
-
neighboring
(alignment_info, j_pegged=None)[source]¶ Determine the neighbors of
alignment_info
, obtained by moving or swapping one alignment pointParameters: j_pegged (int) – If specified, neighbors that have a different alignment point from j_pegged will not be considered Returns: A set neighboring alignments represented by their AlignmentInfo
Return type: set(AlignmentInfo)
-
prob_t_a_given_s
(alignment_info)[source]¶ Probability of target sentence and an alignment given the source sentence
All required information is assumed to be in
alignment_info
and self.Derived classes should override this method
-
sample
(sentence_pair)[source]¶ Sample the most probable alignments from the entire alignment space
First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a higher IBM Model. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point.
Hill climbing may be stuck in a local maxima, hence the pegging and trying out of different alignments.
Parameters: sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from Returns: A set of best alignments represented by their AlignmentInfo
and the best alignment of the set for convenienceReturn type: set(AlignmentInfo), AlignmentInfo
-
nltk.align.phrase_based module¶
-
nltk.align.phrase_based.
extract
(f_start, f_end, e_start, e_end, alignment, e_aligned, f_aligned, srctext, trgtext, srclen, trglen)[source]¶ This function checks for alignment point consistency and extracts phrases using the chunk of consistent phrases.
A phrase pair (e, f ) is consistent with an alignment A if and only if:
No English words in the phrase pair are aligned to words outside it.
∀e i ∈ e, (e i , f j ) ∈ A ⇒ f j ∈ f
No Foreign words in the phrase pair are aligned to words outside it.
∀f j ∈ f , (e i , f j ) ∈ A ⇒ e i ∈ e
The phrase pair contains at least one alignment point.
∃e i ∈ e ̄ , f j ∈ f ̄ s.t. (e i , f j ) ∈ A
Parameters: - f_start (int) – Starting index of the possible foreign language phrases
- f_end (int) – Starting index of the possible foreign language phrases
- e_start (int) – Starting index of the possible source language phrases
- e_end (int) – Starting index of the possible source language phrases
- srctext (list) – The source language tokens, a list of string.
- trgtext (list) – The target language tokens, a list of string.
- srclen (int) – The number of tokens in the source language tokens.
- trglen (int) – The number of tokens in the target language tokens.
-
nltk.align.phrase_based.
phrase_extraction
(srctext, trgtext, alignment)[source]¶ Phrase extraction algorithm extracts all consistent phrase pairs from a word-aligned sentence pair.
The idea is to loop over all possible source language (e) phrases and find the minimal foreign phrase (f) that matches each of them. Matching is done by identifying all alignment points for the source phrase and finding the shortest foreign phrase that includes all the foreign counterparts for the source words.
In short, a phrase alignment has to (a) contain all alignment points for all covered words (b) contain at least one alignment point
>>> srctext = "michael assumes that he will stay in the house" >>> trgtext = "michael geht davon aus , dass er im haus bleibt" >>> alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), ... (5,9), (6,7), (7,7), (8,8)] >>> phrases = phrase_extraction(srctext, trgtext, alignment) >>> for i in sorted(phrases): ... print(i) ... ((0, 1), (0, 1), 'michael', 'michael') ((0, 2), (0, 4), 'michael assumes', 'michael geht davon aus') ((0, 2), (0, 4), 'michael assumes', 'michael geht davon aus ,') ((0, 3), (0, 6), 'michael assumes that', 'michael geht davon aus , dass') ((0, 4), (0, 7), 'michael assumes that he', 'michael geht davon aus , dass er') ((0, 9), (0, 10), 'michael assumes that he will stay in the house', 'michael geht davon aus , dass er im haus bleibt') ((1, 2), (1, 4), 'assumes', 'geht davon aus') ((1, 2), (1, 4), 'assumes', 'geht davon aus ,') ((1, 3), (1, 6), 'assumes that', 'geht davon aus , dass') ((1, 4), (1, 7), 'assumes that he', 'geht davon aus , dass er') ((1, 9), (1, 10), 'assumes that he will stay in the house', 'geht davon aus , dass er im haus bleibt') ((2, 3), (5, 6), 'that', ', dass') ((2, 3), (5, 6), 'that', 'dass') ((2, 4), (5, 7), 'that he', ', dass er') ((2, 4), (5, 7), 'that he', 'dass er') ((2, 9), (5, 10), 'that he will stay in the house', ', dass er im haus bleibt') ((2, 9), (5, 10), 'that he will stay in the house', 'dass er im haus bleibt') ((3, 4), (6, 7), 'he', 'er') ((3, 9), (6, 10), 'he will stay in the house', 'er im haus bleibt') ((4, 6), (9, 10), 'will stay', 'bleibt') ((4, 9), (7, 10), 'will stay in the house', 'im haus bleibt') ((6, 8), (7, 8), 'in the', 'im') ((6, 9), (7, 9), 'in the house', 'im haus') ((8, 9), (8, 9), 'house', 'haus')
Parameters: the first elements of tuples are the source words’ indices and second elements are the target words’ indices. This is also the output format of nltk/align/ibm1.py
Return type: list(tuple) Returns: A list of tuples, each element in a list is a phrase and each phrase is a tuple made up of (i) its source location, (ii) its target location, (iii) the source phrase and (iii) the target phrase. The phrase list of tuples represents all the possible phrases extracted from the word alignments.
nltk.align.util module¶
Module contents¶
Experimental functionality for bitext alignment. These interfaces are prone to change.