nltk.translate.stack_decoder module

A decoder that uses stacks to implement phrase-based translation.

In phrase-based translation, the source sentence is segmented into phrases of one or more words, and translations for those phrases are used to build the target sentence.

Hypothesis data structures are used to keep track of the source words translated so far and the partial output. A hypothesis can be expanded by selecting an untranslated phrase, looking up its translation in a phrase table, and appending that translation to the partial output. Translation is complete when a hypothesis covers all source words.

The search space is huge because the source sentence can be segmented in different ways, the source phrases can be selected in any order, and there could be multiple translations for the same source phrase in the phrase table. To make decoding tractable, stacks are used to limit the number of candidate hypotheses by doing histogram and/or threshold pruning.

Hypotheses with the same number of words translated are placed in the same stack. In histogram pruning, each stack has a size limit, and the hypothesis with the lowest score is removed when the stack is full. In threshold pruning, hypotheses that score below a certain threshold of the best hypothesis in that stack are removed.

Hypothesis scoring can include various factors such as phrase translation probability, language model probability, length of translation, cost of remaining words to be translated, and so on.

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

class nltk.translate.stack_decoder.StackDecoder[source]

Bases: object

Phrase-based stack decoder for machine translation

>>> from nltk.translate import PhraseTable
>>> phrase_table = PhraseTable()
>>> phrase_table.add(('niemand',), ('nobody',), log(0.8))
>>> phrase_table.add(('niemand',), ('no', 'one'), log(0.2))
>>> phrase_table.add(('erwartet',), ('expects',), log(0.8))
>>> phrase_table.add(('erwartet',), ('expecting',), log(0.2))
>>> phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1))
>>> phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8))
>>> phrase_table.add(('!',), ('!',), log(0.8))
>>> #  nltk.model should be used here once it is implemented
>>> from collections import defaultdict
>>> language_prob = defaultdict(lambda: -999.0)
>>> language_prob[('nobody',)] = log(0.5)
>>> language_prob[('expects',)] = log(0.4)
>>> language_prob[('the', 'spanish', 'inquisition')] = log(0.2)
>>> language_prob[('!',)] = log(0.1)
>>> language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()
>>> stack_decoder = StackDecoder(phrase_table, language_model)
>>> stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!'])
['nobody', 'expects', 'the', 'spanish', 'inquisition', '!']
__init__(phrase_table, language_model)[source]
Parameters:
  • phrase_table (PhraseTable) – Table of translations for source language phrases and the log probabilities for those translations.

  • language_model (object) – Target language model. Must define a probability_change method that calculates the change in log probability of a sentence, if a given string is appended to it. This interface is experimental and will likely be replaced with nltk.model once it is implemented.

beam_threshold
float: Hypotheses that score below this factor of the best

hypothesis in a stack are dropped from consideration. Value between 0.0 and 1.0.

compute_future_scores(src_sentence)[source]

Determines the approximate scores for translating every subsequence in src_sentence

Future scores can be used a look-ahead to determine the difficulty of translating the remaining parts of a src_sentence.

Returns:

Scores of subsequences referenced by their start and end positions. For example, result[2][5] is the score of the subsequence covering positions 2, 3, and 4.

Return type:

dict(int: (dict(int): float))

property distortion_factor
float: Amount of reordering of source phrases.

Lower values favour monotone translation, suitable when word order is similar for both source and target languages. Value between 0.0 and 1.0. Default 0.5.

distortion_score(hypothesis, next_src_phrase_span)[source]
expansion_score(hypothesis, translation_option, src_phrase_span)[source]

Calculate the score of expanding hypothesis with translation_option

Parameters:
  • hypothesis (_Hypothesis) – Hypothesis being expanded

  • translation_option (PhraseTableEntry) – Information about the proposed expansion

  • src_phrase_span (tuple(int, int)) – Word position span of the source phrase

find_all_src_phrases(src_sentence)[source]

Finds all subsequences in src_sentence that have a phrase translation in the translation table

Returns:

Subsequences that have a phrase translation, represented as a table of lists of end positions. For example, if result[2] is [5, 6, 9], then there are three phrases starting from position 2 in src_sentence, ending at positions 5, 6, and 9 exclusive. The list of ending positions are in ascending order.

Return type:

list(list(int))

future_score(hypothesis, future_score_table, sentence_length)[source]

Determines the approximate score for translating the untranslated words in hypothesis

stack_size
int: Maximum number of hypotheses to consider in a stack.

Higher values increase the likelihood of a good translation, but increases processing time.

translate(src_sentence)[source]
Parameters:

src_sentence (list(str)) – Sentence to be translated

Returns:

Translated sentence

Return type:

list(str)

static valid_phrases(all_phrases_from, hypothesis)[source]

Extract phrases from all_phrases_from that contains words that have not been translated by hypothesis

Parameters:

all_phrases_from (list(list(int))) – Phrases represented by their spans, in the same format as the return value of find_all_src_phrases

Returns:

A list of phrases, represented by their spans, that cover untranslated positions.

Return type:

list(tuple(int, int))

word_penalty
float: Influences the translation length exponentially.

If positive, shorter translations are preferred. If negative, longer translations are preferred. If zero, no penalty is applied.