nltk.lm package¶

Submodules¶

Module contents¶

NLTK Language Modeling Module.¶

Currently this module covers only ngram language models, but it should be easy to extend to neural models.

Preparing Data¶

Before we train our ngram models it is necessary to make sure the data we put in them is in the right format. Let’s say we have a text that is a list of sentences, where each sentence is a list of strings. For simplicity we just consider a text consisting of characters instead of words.

>>> text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

If we want to train a bigram model, we need to turn this text into bigrams. Here’s what the first sentence of our text would look like if we use a function from NLTK for this.

>>> from nltk.util import bigrams
>>> list(bigrams(text[0]))
[('a', 'b'), ('b', 'c')]

Notice how “b” occurs both as the first and second member of different bigrams but “a” and “c” don’t? Wouldn’t it be nice to somehow indicate how often sentences start with “a” and end with “c”? A standard way to deal with this is to add special “padding” symbols to the sentence before splitting it into ngrams. Fortunately, NLTK also has a function for that, let’s see what it does to the first sentence.

>>> from nltk.util import pad_sequence
>>> list(pad_sequence(text[0],
... pad_left=True,
... left_pad_symbol="<s>",
... pad_right=True,
... right_pad_symbol="</s>",
... n=2))
['<s>', 'a', 'b', 'c', '</s>']

Note the n argument, that tells the function we need padding for bigrams. Now, passing all these parameters every time is tedious and in most cases they can be safely assumed as defaults anyway. Thus our module provides a convenience function that has all these arguments already set while the other arguments remain the same as for pad_sequence.

>>> from nltk.lm.preprocessing import pad_both_ends
>>> list(pad_both_ends(text[0], n=2))
['<s>', 'a', 'b', 'c', '</s>']

Combining the two parts discussed so far we get the following preparation steps for one sentence.

>>> list(bigrams(pad_both_ends(text[0], n=2)))
[('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

To make our model more robust we could also train it on unigrams (single words) as well as bigrams, its main source of information. NLTK once again helpfully provides a function called everygrams. While not the most efficient, it is conceptually simple.

>>> from nltk.util import everygrams
>>> padded_bigrams = list(pad_both_ends(text[0], n=2))
>>> list(everygrams(padded_bigrams, max_len=2))
[('<s>',), ('<s>', 'a'), ('a',), ('a', 'b'), ('b',), ('b', 'c'), ('c',), ('c', '</s>'), ('</s>',)]

We are almost ready to start counting ngrams, just one more step left. During training and evaluation our model will rely on a vocabulary that defines which words are “known” to the model. To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.

>>> from nltk.lm.preprocessing import flatten
>>> list(flatten(pad_both_ends(sent, n=2) for sent in text))
['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

In most cases we want to use the same text as the source for both vocabulary and ngram counts. Now that we understand what this means for our preprocessing, we can simply import a function that does everything for us.

>>> from nltk.lm.preprocessing import padded_everygram_pipeline
>>> train, vocab = padded_everygram_pipeline(2, text)

So as to avoid re-creating the text in memory, both train and vocab are lazy iterators. They are evaluated on demand at training time.

Training¶

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE). We only need to specify the highest ngram order to instantiate it.

>>> from nltk.lm import MLE
>>> lm = MLE(2)

This automatically creates an empty vocabulary…

>>> len(lm.vocab)
0

… which gets filled as we fit the model.

>>> lm.fit(train, vocab)
>>> print(lm.vocab)
<Vocabulary with cutoff=1 unk_label='<UNK>' and 9 items>
>>> len(lm.vocab)
9

The vocabulary helps us handle words that have not occurred during training.

>>> lm.vocab.lookup(text[0])
('a', 'b', 'c')
>>> lm.vocab.lookup(["aliens", "from", "Mars"])
('<UNK>', '<UNK>', '<UNK>')

Moreover, in some cases we want to ignore words that we did see during training but that didn’t occur frequently enough, to provide us useful information. You can tell the vocabulary to ignore such words. To find out how that works, check out the docs for the Vocabulary class.

Using a Trained Model¶

When it comes to ngram models the training boils down to counting up the ngrams from the training corpus.

>>> print(lm.counts)
<NgramCounter with 2 ngram orders and 24 ngrams>

This provides a convenient interface to access counts for unigrams…

>>> lm.counts['a']
2

…and bigrams (in this case “a b”)

>>> lm.counts[['a']]['b']
1

And so on. However, the real purpose of training a language model is to have it score how probable words are in certain contexts. This being MLE, the model returns the item’s relative frequency as its score.

>>> lm.score("a")
0.15384615384615385

Items that are not seen during training are mapped to the vocabulary’s “unknown label” token. This is “<UNK>” by default.

>>> lm.score("<UNK>") == lm.score("aliens")
True

Here’s how you get the score for a word given some preceding context. For example we want to know what is the chance that “b” is preceded by “a”.

>>> lm.score("b", ["a"])
0.5

To avoid underflow when working with many small score values it makes sense to take their logarithm. For convenience this can be done with the logscore method.

>>> lm.logscore("a")
-2.700439718141092

Building on this method, we can also evaluate our model’s cross-entropy and perplexity with respect to sequences of ngrams.

>>> test = [('a', 'b'), ('c', 'd')]
>>> lm.entropy(test)
1.292481250360578
>>> lm.perplexity(test)
2.449489742783178

It is advisable to preprocess your test text exactly the same way as you did the training text.

One cool feature of ngram models is that they can be used to generate text.

>>> lm.generate(1, random_seed=3)
'<s>'
>>> lm.generate(5, random_seed=3)
['<s>', 'a', 'b', 'c', 'd']

Provide random_seed if you want to consistently reproduce the same text all other things being equal. Here we are using it to test the examples.

You can also condition your generation on some preceding text with the context argument.

>>> lm.generate(5, text_seed=['c'], random_seed=3)
['</s>', 'c', 'd', 'c', 'd']

Note that an ngram model is restricted in how much preceding context it can take into account. For example, a trigram model can only condition its output on 2 preceding words. If you pass in a 4-word context, the first two words will be ignored.

class nltk.lm.AbsoluteDiscountingInterpolated[source]¶

Bases: InterpolatedLanguageModel

Interpolated version of smoothing with absolute discount.

__init__(order, discount=0.75, **kwargs)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.

class nltk.lm.KneserNeyInterpolated[source]¶

Bases: InterpolatedLanguageModel

Interpolated version of Kneser-Ney smoothing.

__init__(order, discount=0.1, **kwargs)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.

class nltk.lm.Laplace[source]¶

Bases: Lidstone

Implements Laplace (add one) smoothing.

Initialization identical to BaseNgramModel because gamma is always 1.

__init__(*args, **kwargs)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.

class nltk.lm.Lidstone[source]¶

Bases: LanguageModel

Provides Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

__init__(gamma, *args, **kwargs)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.

unmasked_score(word, context=None)[source]¶

Add-one smoothing: Lidstone or Laplace.

To see what kind, look at gamma attribute on the class.

class nltk.lm.MLE[source]¶

Bases: LanguageModel

Class for providing MLE ngram model scores.

Inherits initialization from BaseNgramModel.

unmasked_score(word, context=None)[source]¶

Returns the MLE score for a word given a context.

Args: - word is expected to be a string - context is expected to be something reasonably convertible to a tuple

class nltk.lm.NgramCounter[source]¶

Bases: object

Class for counting ngrams.

Will count any ngram sequence you give it ;)

First we need to make sure we are feeding the counter sentences of ngrams.

>>> text = [["a", "b", "c", "d"], ["a", "c", "d", "c"]]
>>> from nltk.util import ngrams
>>> text_bigrams = [ngrams(sent, 2) for sent in text]
>>> text_unigrams = [ngrams(sent, 1) for sent in text]

The counting itself is very simple.

>>> from nltk.lm import NgramCounter
>>> ngram_counts = NgramCounter(text_bigrams + text_unigrams)

You can conveniently access ngram counts using standard python dictionary notation. String keys will give you unigram counts.

>>> ngram_counts['a']
2
>>> ngram_counts['aliens']
0

If you want to access counts for higher order ngrams, use a list or a tuple. These are treated as “context” keys, so what you get is a frequency distribution over all continuations after the given context.

>>> sorted(ngram_counts[['a']].items())
[('b', 1), ('c', 1)]
>>> sorted(ngram_counts[('a',)].items())
[('b', 1), ('c', 1)]

This is equivalent to specifying explicitly the order of the ngram (in this case 2 for bigram) and indexing on the context.

>>> ngram_counts[2][('a',)] is ngram_counts[['a']]
True

Note that the keys in ConditionalFreqDist cannot be lists, only tuples! It is generally advisable to use the less verbose and more flexible square bracket notation.

To get the count of the full ngram “a b”, do this:

>>> ngram_counts[['a']]['b']
1

Specifying the ngram order as a number can be useful for accessing all ngrams in that order.

>>> ngram_counts[2]
<ConditionalFreqDist with 4 conditions>

The keys of this ConditionalFreqDist are the contexts we discussed earlier. Unigrams can also be accessed with a human-friendly alias.

>>> ngram_counts.unigrams is ngram_counts[1]
True

Similarly to collections.Counter, you can update counts after initialization.

>>> ngram_counts['e']
0
>>> ngram_counts.update([ngrams(["d", "e", "f"], 1)])
>>> ngram_counts['e']
1

N()[source]¶

Returns grand total number of ngrams stored.

This includes ngrams from all orders, so some duplication is expected. :rtype: int

>>> from nltk.lm import NgramCounter
>>> counts = NgramCounter([[("a", "b"), ("c",), ("d", "e")]])
>>> counts.N()
3

__init__(ngram_text=None)[source]¶

Creates a new NgramCounter.

If ngram_text is specified, counts ngrams from it, otherwise waits for update method to be called explicitly.

Parameters:: ngram_text (Iterable(Iterable(tuple(str))) or None) – Optional text containing sentences of ngrams, as for update method.

update(ngram_text)[source]¶

Updates ngram counts from ngram_text.

Expects ngram_text to be a sequence of sentences (sequences). Each sentence consists of ngrams as tuples of strings.

Parameters:: ngram_text (Iterable(Iterable(tuple(str)))) – Text containing sentences of ngrams.
Raises:: TypeError – if the ngrams are not tuples.

class nltk.lm.StupidBackoff[source]¶

Bases: LanguageModel

Provides StupidBackoff scores.

In addition to initialization arguments from BaseNgramModel also requires a parameter alpha with which we scale the lower order probabilities. Note that this is not a true probability distribution as scores for ngrams of the same order do not sum up to unity.

__init__(alpha=0.4, *args, **kwargs)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.

unmasked_score(word, context=None)[source]¶

Score a word given some optional context.

Concrete models are expected to provide an implementation. Note that this method does not mask its arguments with the OOV label. Use the score method for that.

Parameters:

word (str) – Word for which we want the score
context (tuple(str)) – Context the word is in. If None, compute unigram score.
context – tuple(str) or None

Return type:

float

class nltk.lm.Vocabulary[source]¶

Bases: object

Stores language model vocabulary.

Satisfies two common language modeling requirements for a vocabulary:

When checking membership and calculating its size, filters items by comparing their counts to a cutoff value.
Adds a special “unknown” token which unseen words are mapped to.

>>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd']
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(words, unk_cutoff=2)

Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

>>> vocab['c']
3
>>> 'c' in vocab
True
>>> vocab['d']
2
>>> 'd' in vocab
True

Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved.

>>> vocab['b']
1
>>> 'b' in vocab
False
>>> vocab['aliens']
0
>>> 'aliens' in vocab
False

Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts.

>>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1)
>>> "b" in vocab2
True

The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in len. Note that while the number of keys in the vocabulary’s counter stays the same, the items in the vocabulary differ depending on the cutoff. We use sorted to demonstrate because it keeps the order consistent.

>>> sorted(vocab2.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab2)
['-', '<UNK>', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab)
['<UNK>', 'a', 'c', 'd']

In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called “unknown” items. By default it’s “<UNK>”.

>>> "<UNK>" in vocab
True

We can look up words in a vocabulary using its lookup method. “Unseen” words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string.

>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'

If given a sequence, it will return an tuple of the looked up words.

>>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c'])
('<UNK>', 'a', '<UNK>', 'd', '<UNK>', 'c')

It’s possible to update the counts after the vocabulary has been created. In general, the interface is the same as that of collections.Counter.

>>> vocab['b']
1
>>> vocab.update(["b", "b", "c"])
>>> vocab['b']
3

__init__(counts=None, unk_cutoff=1, unk_label='<UNK>')[source]¶

Create a new Vocabulary.

Parameters:

counts – Optional iterable or collections.Counter instance to pre-seed the Vocabulary. In case it is iterable, counts are calculated.
unk_cutoff (int) – Words that occur less frequently than this value are not considered part of the vocabulary.
unk_label – Label for marking words not part of vocabulary.

property cutoff¶

Cutoff value.

Items with count below this value are not considered part of vocabulary.

lookup(words)[source]¶

Look up one or more words in the vocabulary.

If passed one word as a string will return that word or self.unk_label. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words.

Parameters:: words (Iterable(str) or str) – Word(s) to look up.
Return type:: generator(str) or str
Raises:: TypeError for types other than strings or iterables

>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2)
>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'
>>> vocab.lookup(["a", "b", "c", ["x", "b"]])
('a', 'b', '<UNK>', ('<UNK>', 'b'))

update(*counter_args, **counter_kwargs)[source]¶

Update vocabulary counts.

Wraps collections.Counter.update method.

class nltk.lm.WittenBellInterpolated[source]¶

Bases: InterpolatedLanguageModel

Interpolated version of Witten-Bell smoothing.

__init__(order, **kwargs)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.