nltk.metrics.association module

Provides scoring functions for a number of association measures through a generic, abstract implementation in NgramAssocMeasures, and n-specific BigramAssocMeasures and TrigramAssocMeasures.

class nltk.metrics.association.BigramAssocMeasures[source]

Bases: NgramAssocMeasures

A collection of bigram association measures. Each association measure is provided as a function with three arguments:

bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:

  • n_ii counts (w1, w2), i.e. the bigram being scored

  • n_ix counts (w1, *)

  • n_xi counts (*, w2)

  • n_xx counts (*, *), i.e. any bigram

This may be shown with respect to a contingency table:

        w1    ~w1
     ------ ------
 w2 | n_ii | n_oi | = n_xi
     ------ ------
~w2 | n_io | n_oo |
     ------ ------
     = n_ix        TOTAL = n_xx
classmethod chi_sq(n_ii, n_ix_xi_tuple, n_xx)[source]

Scores bigrams using chi-square, i.e. phi-sq multiplied by the number of bigrams, as in Manning and Schutze 5.3.3.

static dice(n_ii, n_ix_xi_tuple, n_xx)[source]

Scores bigrams using Dice’s coefficient.

classmethod fisher(*marginals)[source]

Scores bigrams using Fisher’s Exact Test (Pedersen 1996). Less sensitive to small counts than PMI or Chi Sq, but also more expensive to compute. Requires scipy.

classmethod phi_sq(*marginals)[source]

Scores bigrams using phi-square, the square of the Pearson correlation coefficient.

class nltk.metrics.association.ContingencyMeasures[source]

Bases: object

Wraps NgramAssocMeasures classes such that the arguments of association measures are contingency table values rather than marginals.

__init__(measures)[source]

Constructs a ContingencyMeasures given a NgramAssocMeasures class

nltk.metrics.association.NGRAM = 0

Marginals index for the ngram count

class nltk.metrics.association.NgramAssocMeasures[source]

Bases: object

An abstract class defining a collection of generic association measures. Each public method returns a score, taking the following arguments:

score_fn(count_of_ngram,
         (count_of_n-1gram_1, ..., count_of_n-1gram_j),
         (count_of_n-2gram_1, ..., count_of_n-2gram_k),
         ...,
         (count_of_1gram_1, ..., count_of_1gram_n),
         count_of_total_words)

See BigramAssocMeasures and TrigramAssocMeasures

Inheriting classes should define a property _n, and a method _contingency which calculates contingency values from marginals in order for all association measures defined here to be usable.

classmethod chi_sq(*marginals)[source]

Scores ngrams using Pearson’s chi-square as in Manning and Schutze 5.3.3.

classmethod jaccard(*marginals)[source]

Scores ngrams using the Jaccard index.

classmethod likelihood_ratio(*marginals)[source]

Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4.

static mi_like(*marginals, **kwargs)[source]

Scores ngrams using a variant of mutual information. The keyword argument power sets an exponent (default 3) for the numerator. No logarithm of the result is calculated.

classmethod pmi(*marginals)[source]

Scores ngrams by pointwise mutual information, as in Manning and Schutze 5.4.

classmethod poisson_stirling(*marginals)[source]

Scores ngrams using the Poisson-Stirling measure.

static raw_freq(*marginals)[source]

Scores ngrams by their frequency

classmethod student_t(*marginals)[source]

Scores ngrams using Student’s t test with independence hypothesis for unigrams, as in Manning and Schutze 5.3.1.

class nltk.metrics.association.QuadgramAssocMeasures[source]

Bases: NgramAssocMeasures

A collection of quadgram association measures. Each association measure is provided as a function with five arguments:

trigram_score_fn(n_iiii,
                (n_iiix, n_iixi, n_ixii, n_xiii),
                (n_iixx, n_ixix, n_ixxi, n_xixi, n_xxii, n_xiix),
                (n_ixxx, n_xixx, n_xxix, n_xxxi),
                n_all)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:

  • n_iiii counts (w1, w2, w3, w4), i.e. the quadgram being scored

  • n_ixxi counts (w1, *, *, w4)

  • n_xxxx counts (*, *, *, *), i.e. any quadgram

nltk.metrics.association.TOTAL = -1

Marginals index for the number of words in the data

class nltk.metrics.association.TrigramAssocMeasures[source]

Bases: NgramAssocMeasures

A collection of trigram association measures. Each association measure is provided as a function with four arguments:

trigram_score_fn(n_iii,
                 (n_iix, n_ixi, n_xii),
                 (n_ixx, n_xix, n_xxi),
                 n_xxx)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:

  • n_iii counts (w1, w2, w3), i.e. the trigram being scored

  • n_ixx counts (w1, *, *)

  • n_xxx counts (*, *, *), i.e. any trigram

nltk.metrics.association.UNIGRAMS = -2

Marginals index for a tuple of each unigram count

nltk.metrics.association.fisher_exact(*_args, **_kwargs)[source]