nltk.collocations module

Tools to identify collocations — words that often appear consecutively — within corpora. They may also be used to find other associations between word occurrences. See Manning and Schutze ch. 5 at https://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl package at http://ngram.sourceforge.net

Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation.

The BigramCollocationFinder and TrigramCollocationFinder classes provide these functionalities, dependent on being provided a function which scores a ngram given appropriate frequency counts. A number of standard association measures are provided in bigram_measures and trigram_measures.

class nltk.collocations.BigramCollocationFinder[source]

Bases: AbstractCollocationFinder

A tool for the finding and ranking of bigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.

__init__(word_fd, bigram_fd, window_size=2)[source]

Construct a BigramCollocationFinder, given FreqDists for appearances of words and (possibly non-contiguous) bigrams.

default_ws = 2
classmethod from_words(words, window_size=2)[source]

Construct a BigramCollocationFinder for all bigrams in the given sequence. When window_size > 2, count non-contiguous bigrams, in the style of Church and Hanks’s (1990) association ratio.

score_ngram(score_fn, w1, w2)[source]

Returns the score for a given bigram using the given scoring function. Following Church and Hanks (1990), counts are scaled by a factor of 1/(window_size - 1).

class nltk.collocations.QuadgramCollocationFinder[source]

Bases: AbstractCollocationFinder

A tool for the finding and ranking of quadgram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.

__init__(word_fd, quadgram_fd, ii, iii, ixi, ixxi, iixi, ixii)[source]

Construct a QuadgramCollocationFinder, given FreqDists for appearances of words, bigrams, trigrams, two words with one word and two words between them, three words with a word between them in both variations.

default_ws = 4
classmethod from_words(words, window_size=4)[source]
score_ngram(score_fn, w1, w2, w3, w4)[source]
class nltk.collocations.TrigramCollocationFinder[source]

Bases: AbstractCollocationFinder

A tool for the finding and ranking of trigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.

__init__(word_fd, bigram_fd, wildcard_fd, trigram_fd)[source]

Construct a TrigramCollocationFinder, given FreqDists for appearances of words, bigrams, two words with any word between them, and trigrams.

bigram_finder()[source]

Constructs a bigram collocation finder with the bigram and unigram data from this finder. Note that this does not include any filtering applied to this finder.

default_ws = 3
classmethod from_words(words, window_size=3)[source]

Construct a TrigramCollocationFinder for all trigrams in the given sequence.

score_ngram(score_fn, w1, w2, w3)[source]

Returns the score for a given trigram using the given scoring function.