nltk.lm.api module¶

Language Model Interface.

class nltk.lm.api.LanguageModel[source]¶

Bases: object

ABC for Language Models.

Cannot be directly instantiated itself.

__init__(order, vocabulary=None, counter=None)[source]¶

Creates new LanguageModel.

Parameters:

vocabulary (nltk.lm.Vocabulary or None) – If provided, this vocabulary will be used instead of creating a new one when training.
counter (nltk.lm.NgramCounter or None) – If provided, use this object to count ngrams.
ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.
pad_fn (function or None) – If given, defines how sentences in training text are padded.

context_counts(context)[source]¶

Helper method for retrieving counts for a given context.

Assumes context has been checked and oov words in it masked. :type context: tuple(str) or None

entropy(text_ngrams)[source]¶

Calculate cross-entropy of model for given evaluation text.

This implementation is based on the Shannon-McMillan-Breiman theorem, as used and referenced by Dan Jurafsky and Jordan Boyd-Graber.

Parameters:: text_ngrams (Iterable(tuple(str))) – A sequence of ngram tuples.
Return type:: float

fit(text, vocabulary_text=None)[source]¶

Trains the model on a text.

Parameters:: text – Training text as a sequence of sentences.

generate(num_words=1, text_seed=None, random_seed=None)[source]¶

Generate words from the model.

Parameters:

num_words (int) – How many words to generate. By default 1.
text_seed – Generation can be conditioned on preceding context.
random_seed – A random seed or an instance of random.Random. If provided, makes the random sampling part of generation reproducible.

Returns:

One (str) word or a list of words generated from model.

Examples:

>>> from nltk.lm import MLE
>>> lm = MLE(2)
>>> lm.fit([[("a", "b"), ("b", "c")]], vocabulary_text=['a', 'b', 'c'])
>>> lm.fit([[("a",), ("b",), ("c",)]])
>>> lm.generate(random_seed=3)
'a'
>>> lm.generate(text_seed=['a'])
'b'

logscore(word, context=None)[source]¶

Evaluate the log score of this word in this context.

The arguments are the same as for score and unmasked_score.

perplexity(text_ngrams)[source]¶

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy for the text, so the arguments are the same.

score(word, context=None)[source]¶

Masks out of vocab (OOV) words and computes their model score.

For model-specific logic of calculating scores, see the unmasked_score method.

abstract unmasked_score(word, context=None)[source]¶

Score a word given some optional context.

Concrete models are expected to provide an implementation. Note that this method does not mask its arguments with the OOV label. Use the score method for that.

Parameters:

word (str) – Word for which we want the score
context (tuple(str)) – Context the word is in. If None, compute unigram score.
context – tuple(str) or None

Return type:

float

class nltk.lm.api.Smoothing[source]¶

Bases: object

Ngram Smoothing Interface

Implements Chen & Goodman 1995’s idea that all smoothing algorithms have certain features in common. This should ideally allow smoothing algorithms to work both with Backoff and Interpolation.

__init__(vocabulary, counter)[source]¶

Parameters:

vocabulary (nltk.lm.vocab.Vocabulary) – The Ngram vocabulary object.
counter (nltk.lm.counter.NgramCounter) – The counts of the vocabulary items.

abstract alpha_gamma(word, context)[source]¶

abstract unigram_score(word)[source]¶

NLTK

Documentation

nltk.lm.api module¶