nltk.metrics.association module¶
Provides scoring functions for a number of association measures through a
generic, abstract implementation in NgramAssocMeasures
, and n-specific
BigramAssocMeasures
and TrigramAssocMeasures
.
- nltk.metrics.association.NGRAM = 0¶
Marginals index for the ngram count
- nltk.metrics.association.UNIGRAMS = -2¶
Marginals index for a tuple of each unigram count
- nltk.metrics.association.TOTAL = -1¶
Marginals index for the number of words in the data
- class nltk.metrics.association.NgramAssocMeasures[source]¶
Bases:
object
An abstract class defining a collection of generic association measures. Each public method returns a score, taking the following arguments:
score_fn(count_of_ngram, (count_of_n-1gram_1, ..., count_of_n-1gram_j), (count_of_n-2gram_1, ..., count_of_n-2gram_k), ..., (count_of_1gram_1, ..., count_of_1gram_n), count_of_total_words)
See
BigramAssocMeasures
andTrigramAssocMeasures
Inheriting classes should define a property _n, and a method _contingency which calculates contingency values from marginals in order for all association measures defined here to be usable.
- classmethod student_t(*marginals)[source]¶
Scores ngrams using Student’s t test with independence hypothesis for unigrams, as in Manning and Schutze 5.3.1.
- classmethod chi_sq(*marginals)[source]¶
Scores ngrams using Pearson’s chi-square as in Manning and Schutze 5.3.3.
- static mi_like(*marginals, **kwargs)[source]¶
Scores ngrams using a variant of mutual information. The keyword argument power sets an exponent (default 3) for the numerator. No logarithm of the result is calculated.
- classmethod pmi(*marginals)[source]¶
Scores ngrams by pointwise mutual information, as in Manning and Schutze 5.4.
- class nltk.metrics.association.BigramAssocMeasures[source]¶
Bases:
nltk.metrics.association.NgramAssocMeasures
A collection of bigram association measures. Each association measure is provided as a function with three arguments:
bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)
The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:
n_ii counts
(w1, w2)
, i.e. the bigram being scoredn_ix counts
(w1, *)
n_xi counts
(*, w2)
n_xx counts
(*, *)
, i.e. any bigram
This may be shown with respect to a contingency table:
w1 ~w1 ------ ------ w2 | n_ii | n_oi | = n_xi ------ ------ ~w2 | n_io | n_oo | ------ ------ = n_ix TOTAL = n_xx
- classmethod phi_sq(*marginals)[source]¶
Scores bigrams using phi-square, the square of the Pearson correlation coefficient.
- classmethod chi_sq(n_ii, n_ix_xi_tuple, n_xx)[source]¶
Scores bigrams using chi-square, i.e. phi-sq multiplied by the number of bigrams, as in Manning and Schutze 5.3.3.
- class nltk.metrics.association.TrigramAssocMeasures[source]¶
Bases:
nltk.metrics.association.NgramAssocMeasures
A collection of trigram association measures. Each association measure is provided as a function with four arguments:
trigram_score_fn(n_iii, (n_iix, n_ixi, n_xii), (n_ixx, n_xix, n_xxi), n_xxx)
The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:
n_iii counts
(w1, w2, w3)
, i.e. the trigram being scoredn_ixx counts
(w1, *, *)
n_xxx counts
(*, *, *)
, i.e. any trigram
- class nltk.metrics.association.QuadgramAssocMeasures[source]¶
Bases:
nltk.metrics.association.NgramAssocMeasures
A collection of quadgram association measures. Each association measure is provided as a function with five arguments:
trigram_score_fn(n_iiii, (n_iiix, n_iixi, n_ixii, n_xiii), (n_iixx, n_ixix, n_ixxi, n_xixi, n_xxii, n_xiix), (n_ixxx, n_xixx, n_xxix, n_xxxi), n_all)
The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:
n_iiii counts
(w1, w2, w3, w4)
, i.e. the quadgram being scoredn_ixxi counts
(w1, *, *, w4)
n_xxxx counts
(*, *, *, *)
, i.e. any quadgram