nltk.metrics package¶
Submodules¶
nltk.metrics.agreement module¶
Implementations of interannotator agreement coefficients surveyed by Artstein and Poesio (2007), InterCoder Agreement for Computational Linguistics.
An agreement coefficient calculates the amount that annotators agreed on label assignments beyond what is expected by chance.
In defining the AnnotationTask class, we use naming conventions similar to the paper’s terminology. There are three types of objects in an annotation task:
the coders (variables “c” and “C”) the items to be annotated (variables “i” and “I”) the potential categories to be assigned (variables “k” and “K”)
Additionally, it is often the case that we don’t want to treat two different labels as complete disagreement, and so the AnnotationTask constructor can also take a distance metric as a final argument. Distance metrics are simply functions that take two arguments, and return a value between 0.0 and 1.0 indicating the distance between them. If not supplied, the default is binary comparison between the arguments.
The simplest way to initialize an AnnotationTask is with a list of triples, each containing a coder’s assignment for one object in the task:
task = AnnotationTask(data=[(‘c1’, ‘1’, ‘v1’),(‘c2’, ‘1’, ‘v1’),…])
Note that the data list needs to contain the same number of triples for each individual coder, containing category values for the same set of items.
Alpha (Krippendorff 1980) Kappa (Cohen 1960) S (Bennet, Albert and Goldstein 1954) Pi (Scott 1955)
TODO: Describe handling of multiple coders and missing data
Expected results from the Artstein and Poesio survey paper:
>>> from nltk.metrics.agreement import AnnotationTask >>> import os.path >>> t = AnnotationTask(data=[x.split() for x in open(os.path.join(os.path.dirname(__file__), "artstein_poesio_example.txt"))]) >>> t.avg_Ao() 0.88 >>> t.pi() 0.7995322418977615... >>> t.S() 0.8199999999999998...This would have returned a wrong value (0.0) in @785fb79 as coders are in the wrong order. Subsequently, all values for pi(), S(), and kappa() would have been wrong as they are computed with avg_Ao(). >>> t2 = AnnotationTask(data=[(‘b’,‘1’,’stat’),(‘a’,‘1’,’stat’)]) >>> t2.avg_Ao() 1.0
The following, of course, also works. >>> t3 = AnnotationTask(data=[(‘a’,‘1’,’othr’),(‘b’,‘1’,’othr’)]) >>> t3.avg_Ao() 1.0

class
nltk.metrics.agreement.
AnnotationTask
(data=None, distance=<function binary_distance>)[source]¶ Bases:
object
Represents an annotation task, i.e. people assign labels to items.
Notation tries to match notation in Artstein and Poesio (2007).
In general, coders and items can be represented as any hashable object. Integers, for example, are fine, though strings are more readable. Labels must support the distance functions applied to them, so e.g. a stringeditdistance makes no sense if your labels are integers, whereas interval distance needs numeric values. A notable case of this is the MASI metric, which requires Python sets.

Do_Kw_pairwise
(cA, cB, max_distance=1.0)[source]¶ The observed disagreement for the weighted kappa coefficient.

Do_alpha
()[source]¶ The observed disagreement for the alpha coefficient.
The alpha coefficient, unlike the other metrics, uses this rather than observed agreement.

N
(**kwargs)¶ Implements the “nnotation” used in Artstein and Poesio (2007)
@deprecated: Use Nk, Nik or Nck instead

load_array
(array)[source]¶ Load an sequence of annotation results, appending to any data already loaded.
 The argument is a sequence of 3tuples, each representing a coder’s labeling of an item:
 (coder,item,label)

multi_kappa
()[source]¶ Davies and Fleiss 1982 Averages over observed and expected agreements for each coder pair.

unicode_repr
¶ Return repr(self).

nltk.metrics.aline module¶
ALINE http://webdocs.cs.ualberta.ca/~kondrak/ Copyright 2002 by Grzegorz Kondrak.
ALINE is an algorithm for aligning phonetic sequences, described in [1]. This module is a port of Kondrak’s (2002) ALINE. It provides functions for phonetic sequence alignment and similarity analysis. These are useful in historical linguistics, sociolinguistics and synchronic phonology.
ALINE has parameters that can be tuned for desired output. These parameters are:  C_skip, C_sub, C_exp, C_vwl  Salience weights  Segmental features
In this implementation, some parameters have been changed from their default values as described in [1], in order to replicate published results. All changes are noted in comments.
Example usage¶
# Get optimal alignment of two phonetic sequences
>>> align('θin', 'tenwis')
[[('θ', 't'), ('i', 'e'), ('n', 'n'), ('', 'w'), ('', 'i'), ('', 's')]]
[1] G. Kondrak. Algorithms for Language Reconstruction. PhD dissertation, University of Toronto.

nltk.metrics.aline.
R
(p, q)[source]¶ Return relevant features for segment comparsion.
(Kondrak 2002: 54)

nltk.metrics.aline.
align
(str1, str2, epsilon=0)[source]¶ Compute the alignment of two phonetic strings.
Parameters:  str2 (str1,) – Two strings to be aligned
 epsilon (float (0.0 to 1.0)) – Adjusts threshold similarity score for nearoptimal alignments
Rtpye: list(list(tuple(str, str)))
Returns: Alignment(s) of str1 and str2
(Kondrak 2002: 51)

nltk.metrics.aline.
delta
(p, q)[source]¶ Return weighted sum of difference between P and Q.
(Kondrak 2002: 54)

nltk.metrics.aline.
demo
()[source]¶ A demonstration of the result of aligning phonetic sequences used in Kondrak’s (2002) dissertation.

nltk.metrics.aline.
diff
(p, q, f)[source]¶ Returns difference between phonetic segments P and Q for feature F.
(Kondrak 2002: 52, 54)
nltk.metrics.association module¶
Provides scoring functions for a number of association measures through a
generic, abstract implementation in NgramAssocMeasures
, and nspecific
BigramAssocMeasures
and TrigramAssocMeasures
.

class
nltk.metrics.association.
BigramAssocMeasures
[source]¶ Bases:
nltk.metrics.association.NgramAssocMeasures
A collection of bigram association measures. Each association measure is provided as a function with three arguments:
bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)
The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:
This may be shown with respect to a contingency table:
w1 ~w1   w2  n_ii  n_oi  = n_xi   ~w2  n_io  n_oo    = n_ix TOTAL = n_xx

classmethod
chi_sq
(n_ii, n_ix_xi_tuple, n_xx)[source]¶ Scores bigrams using chisquare, i.e. phisq multiplied by the number of bigrams, as in Manning and Schutze 5.3.3.

classmethod

class
nltk.metrics.association.
ContingencyMeasures
(measures)[source]¶ Bases:
object
Wraps NgramAssocMeasures classes such that the arguments of association measures are contingency table values rather than marginals.

nltk.metrics.association.
NGRAM
= 0¶ Marginals index for the ngram count

class
nltk.metrics.association.
NgramAssocMeasures
[source]¶ Bases:
object
An abstract class defining a collection of generic association measures. Each public method returns a score, taking the following arguments:
score_fn(count_of_ngram, (count_of_n1gram_1, ..., count_of_n1gram_j), (count_of_n2gram_1, ..., count_of_n2gram_k), ..., (count_of_1gram_1, ..., count_of_1gram_n), count_of_total_words)
See
BigramAssocMeasures
andTrigramAssocMeasures
Inheriting classes should define a property _n, and a method _contingency which calculates contingency values from marginals in order for all association measures defined here to be usable.

classmethod
chi_sq
(*marginals)[source]¶ Scores ngrams using Pearson’s chisquare as in Manning and Schutze 5.3.3.

classmethod
likelihood_ratio
(*marginals)[source]¶ Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4.

static
mi_like
(*marginals, **kwargs)[source]¶ Scores ngrams using a variant of mutual information. The keyword argument power sets an exponent (default 3) for the numerator. No logarithm of the result is calculated.

classmethod

class
nltk.metrics.association.
QuadgramAssocMeasures
[source]¶ Bases:
nltk.metrics.association.NgramAssocMeasures
A collection of quadgram association measures. Each association measure is provided as a function with five arguments:
trigram_score_fn(n_iiii, (n_iiix, n_iixi, n_ixii, n_xiii), (n_iixx, n_ixix, n_ixxi, n_xixi, n_xxii, n_xiix), (n_ixxx, n_xixx, n_xxix, n_xxxi), n_all)
The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example: n_iiii counts (w1, w2, w3, w4), i.e. the quadgram being scored n_ixxi counts (w1, , *, w4) n_xxxx counts (, *, *, *), i.e. any quadgram

nltk.metrics.association.
TOTAL
= 1¶ Marginals index for the number of words in the data

class
nltk.metrics.association.
TrigramAssocMeasures
[source]¶ Bases:
nltk.metrics.association.NgramAssocMeasures
A collection of trigram association measures. Each association measure is provided as a function with four arguments:
trigram_score_fn(n_iii, (n_iix, n_ixi, n_xii), (n_ixx, n_xix, n_xxi), n_xxx)
The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example: n_iii counts (w1, w2, w3), i.e. the trigram being scored n_ixx counts (w1, , *) n_xxx counts (, *, *), i.e. any trigram

nltk.metrics.association.
UNIGRAMS
= 2¶ Marginals index for a tuple of each unigram count
nltk.metrics.confusionmatrix module¶

class
nltk.metrics.confusionmatrix.
ConfusionMatrix
(reference, test, sort_by_count=False)[source]¶ Bases:
object
The confusion matrix between a list of reference values and a corresponding list of test values. Entry [r,t] of this matrix is a count of the number of times that the reference value r corresponds to the test value t. E.g.:
>>> from nltk.metrics import ConfusionMatrix >>> ref = 'DET NN VB DET JJ NN NN IN DET NN'.split() >>> test = 'DET VB VB DET NN NN NN IN DET NN'.split() >>> cm = ConfusionMatrix(ref, test) >>> print(cm['NN', 'NN']) 3
Note that the diagonal entries Ri=Tj of this matrix corresponds to correct values; and the offdiagonal entries correspond to incorrect values.

pretty_format
(show_percents=False, values_in_chart=True, truncate=None, sort_by_count=False)[source]¶ Returns: A multiline string representation of this confusion matrix.
Parameters:  truncate (int) – If specified, then only show the specified number of values. Any sorting (e.g., sort_by_count) will be performed before truncation.
 sort_by_count – If true, then sort by the count of each label in the reference data. I.e., labels that occur more frequently in the reference label will be towards the left edge of the matrix, and labels that occur less frequently will be towards the right edge.
@todo: add marginals?

unicode_repr
()¶ Return repr(self).

nltk.metrics.distance module¶
Distance Metrics.
Compute the distance between two items (usually strings). As metrics, they must satisfy the following three requirements:
 d(a, a) = 0
 d(a, b) >= 0
 d(a, c) <= d(a, b) + d(b, c)

nltk.metrics.distance.
binary_distance
(label1, label2)[source]¶ Simple equality test.
0.0 if the labels are identical, 1.0 if they are different.
>>> from nltk.metrics import binary_distance >>> binary_distance(1,1) 0.0
>>> binary_distance(1,3) 1.0

nltk.metrics.distance.
edit_distance
(s1, s2, substitution_cost=1, transpositions=False)[source]¶ Calculate the Levenshtein editdistance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming “rain” to “shine” requires three steps, consisting of two substitutions and one insertion: “rain” > “sain” > “shin” > “shine”. These operations could have been done in other orders, but at least three steps are needed.
Allows specifying the cost of substitution edits (e.g., “a” > “b”), because sometimes it makes sense to assign greater penalties to substitutions.
This also optionally allows transposition edits (e.g., “ab” > “ba”), though this is disabled by default.
Parameters:  s2 (str) – The strings to be analysed
 transpositions (bool) – Whether to allow transposition edits
:rtype int

nltk.metrics.distance.
interval_distance
(label1, label2)[source]¶ Krippendorff’s interval distance metric
>>> from nltk.metrics import interval_distance >>> interval_distance(1,10) 81
Krippendorff 1980, Content Analysis: An Introduction to its Methodology

nltk.metrics.distance.
jaccard_distance
(label1, label2)[source]¶ Distance metric comparing setsimilarity.

nltk.metrics.distance.
masi_distance
(label1, label2)[source]¶ Distance metric that takes into account partial agreement when multiple labels are assigned.
>>> from nltk.metrics import masi_distance >>> masi_distance(set([1, 2]), set([1, 2, 3, 4])) 0.665
Passonneau 2006, Measuring Agreement on SetValued Items (MASI) for Semantic and Pragmatic Annotation.
nltk.metrics.paice module¶
Counts Paice’s performance statistics for evaluating stemming algorithms.
 What is required:
 A dictionary of words grouped by their real lemmas
 A dictionary of words grouped by stems from a stemming algorithm
When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Errorrate relative to truncation (ERRT) are counted.
References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42–50.

class
nltk.metrics.paice.
Paice
(lemmas, stems)[source]¶ Bases:
object
Class for storing lemmas, stems and evaluation metrics.

nltk.metrics.paice.
get_words_from_dictionary
(lemmas)[source]¶ Get original set of words used for analysis.
Parameters: lemmas – A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)
nltk.metrics.scores module¶

nltk.metrics.scores.
accuracy
(reference, test)[source]¶ Given a list of reference values and a corresponding list of test values, return the fraction of corresponding values that are equal. In particular, return the fraction of indices
0<i<=len(test)
such thattest[i] == reference[i]
.Parameters:  reference (list) – An ordered list of reference values.
 test (list) – A list of values to compare against the corresponding reference values.
Raises: ValueError – If
reference
andlength
do not have the same length.

nltk.metrics.scores.
approxrand
(a, b, **kwargs)[source]¶ Returns an approximate significance level between two lists of independently generated test values.
Approximate randomization calculates significance by randomly drawing from a sample of the possible permutations. At the limit of the number of possible permutations, the significance level is exact. The approximate significance level is the sample mean number of times the statistic of the permutated lists varies from the actual statistic of the unpermuted argument lists.
Returns: a tuple containing an approximate significance level, the count of the number of times the pseudostatistic varied from the actual statistic, and the number of shuffles
Return type: tuple
Parameters:  a (list) – a list of test values
 b (list) – another list of independently generated test values

nltk.metrics.scores.
f_measure
(reference, test, alpha=0.5)[source]¶ Given a set of reference values and a set of test values, return the fmeasure of the test values, when compared against the reference values. The fmeasure is the harmonic mean of the
precision
andrecall
, weighted byalpha
. In particular, given the precision p and recall r defined by: p = card(
reference
intersectiontest
)/card(test
)  r = card(
reference
intersectiontest
)/card(reference
)
The fmeasure is:
 1/(alpha/p + (1alpha)/r)
If either
reference
ortest
is empty, thenf_measure
returns None.Parameters:  reference (set) – A set of reference values.
 test (set) – A set of values to compare against the reference set.
Return type: float or None
 p = card(

nltk.metrics.scores.
log_likelihood
(reference, test)[source]¶ Given a list of reference values and a corresponding list of test probability distributions, return the average log likelihood of the reference values, given the probability distributions.
Parameters:  reference (list) – A list of reference values
 test (list(ProbDistI)) – A list of probability distributions over values to compare against the corresponding reference values.

nltk.metrics.scores.
precision
(reference, test)[source]¶ Given a set of reference values and a set of test values, return the fraction of test values that appear in the reference set. In particular, return card(
reference
intersectiontest
)/card(test
). Iftest
is empty, then return None.Parameters:  reference (set) – A set of reference values.
 test (set) – A set of values to compare against the reference set.
Return type: float or None

nltk.metrics.scores.
recall
(reference, test)[source]¶ Given a set of reference values and a set of test values, return the fraction of reference values that appear in the test set. In particular, return card(
reference
intersectiontest
)/card(reference
). Ifreference
is empty, then return None.Parameters:  reference (set) – A set of reference values.
 test (set) – A set of values to compare against the reference set.
Return type: float or None
nltk.metrics.segmentation module¶
Text Segmentation Metrics
 Windowdiff
 Pevzner, L., and Hearst, M., A Critique and Improvement of
 an Evaluation Metric for Text Segmentation,
Computational Linguistics 28, 1936
 Generalized Hamming Distance
Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353375
Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html
Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l’efficacite en segmentation de textes ? TALN 2009
 Pk text segmentation metric
Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177210

nltk.metrics.segmentation.
ghd
(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1')[source]¶ Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.
A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.
Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.
>>> # Same examples as Kulyukin C++ implementation >>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5) 0.5 >>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5) 2.0 >>> ghd('011', '110', 1.0, 1.0, 0.5) 1.0 >>> ghd('1', '0', 1.0, 1.0, 0.5) 1.0 >>> ghd('111', '000', 1.0, 1.0, 0.5) 3.0 >>> ghd('000', '111', 1.0, 2.0, 0.5) 6.0
Parameters:  ref (str or list) – the reference segmentation
 hyp (str or list) – the hypothetical segmentation
 ins_cost (float) – insertion cost
 del_cost (float) – deletion cost
 shift_cost_coeff – constant used to compute the cost of a shift.
shift cost = shift_cost_coeff * i  j where i and j are the positions indicating the shift :type shift_cost_coeff: float :param boundary: boundary value :type boundary: str or int or bool :rtype: float

nltk.metrics.segmentation.
pk
(ref, hyp, k=None, boundary='1')[source]¶ Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.
>>> '%.2f' % pk('0100'*100, '1'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0100'*100, 2) '0.00'
Parameters:  ref (str or list) – the reference segmentation
 hyp (str or list) – the segmentation to evaluate
 k – window size, if None, set to half of the average reference segment length
 boundary (str or int or bool) – boundary value
Return type: float

nltk.metrics.segmentation.
windowdiff
(seg1, seg2, k, boundary='1', weighted=False)[source]¶ Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.
>>> s1 = "000100000010" >>> s2 = "000010000100" >>> s3 = "100000010000" >>> '%.2f' % windowdiff(s1, s1, 3) '0.00' >>> '%.2f' % windowdiff(s1, s2, 3) '0.30' >>> '%.2f' % windowdiff(s2, s3, 3) '0.80'
Parameters:  seg1 (str or list) – a segmentation
 seg2 (str or list) – a segmentation
 k (int) – window width
 boundary (str or int or bool) – boundary value
 weighted (boolean) – use the weighted variant of windowdiff
Return type: float
nltk.metrics.spearman module¶

nltk.metrics.spearman.
ranks_from_scores
(scores, rank_gap=1e15)[source]¶ Given a sequence of (key, score) tuples, yields each key with an increasing rank, tying with previous key’s rank if the difference between their scores is less than rank_gap. Suitable for use as an argument to
spearman_correlation
.

nltk.metrics.spearman.
ranks_from_sequence
(seq)[source]¶ Given a sequence, yields each element with an increasing rank, suitable for use as an argument to
spearman_correlation
.

nltk.metrics.spearman.
spearman_correlation
(ranks1, ranks2)[source]¶ Returns the Spearman correlation coefficient for two rankings, which should be dicts or sequences of (key, rank). The coefficient ranges from 1.0 (ranks are opposite) to 1.0 (ranks are identical), and is only calculated for keys in both rankings (for meaningful results, remove keys present in only one list before ranking).