nltk.metrics.segmentation module¶
Text Segmentation Metrics
Windowdiff
- Pevzner, L., and Hearst, M., A Critique and Improvement of
an Evaluation Metric for Text Segmentation, Computational Linguistics 28, 19-36
Generalized Hamming Distance
Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353-375
Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html
Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l’efficacite en segmentation de textes ? TALN 2009
Pk text segmentation metric
Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177-210
- nltk.metrics.segmentation.ghd(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1')[source]¶
Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.
A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.
Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.
>>> # Same examples as Kulyukin C++ implementation >>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5) 0.5 >>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5) 2.0 >>> ghd('011', '110', 1.0, 1.0, 0.5) 1.0 >>> ghd('1', '0', 1.0, 1.0, 0.5) 1.0 >>> ghd('111', '000', 1.0, 1.0, 0.5) 3.0 >>> ghd('000', '111', 1.0, 2.0, 0.5) 6.0
- Parameters:
ref (str or list) – the reference segmentation
hyp (str or list) – the hypothetical segmentation
ins_cost (float) – insertion cost
del_cost (float) – deletion cost
shift_cost_coeff (float) – constant used to compute the cost of a shift.
shift cost = shift_cost_coeff * |i - j|
wherei
andj
are the positions indicating the shiftboundary (str or int or bool) – boundary value
- Return type:
float
- nltk.metrics.segmentation.pk(ref, hyp, k=None, boundary='1')[source]¶
Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.
>>> '%.2f' % pk('0100'*100, '1'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0100'*100, 2) '0.00'
- Parameters:
ref (str or list) – the reference segmentation
hyp (str or list) – the segmentation to evaluate
k – window size, if None, set to half of the average reference segment length
boundary (str or int or bool) – boundary value
- Return type:
float
- nltk.metrics.segmentation.windowdiff(seg1, seg2, k, boundary='1', weighted=False)[source]¶
Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.
>>> s1 = "000100000010" >>> s2 = "000010000100" >>> s3 = "100000010000" >>> '%.2f' % windowdiff(s1, s1, 3) '0.00' >>> '%.2f' % windowdiff(s1, s2, 3) '0.30' >>> '%.2f' % windowdiff(s2, s3, 3) '0.80'
- Parameters:
seg1 (str or list) – a segmentation
seg2 (str or list) – a segmentation
k (int) – window width
boundary (str or int or bool) – boundary value
weighted (boolean) – use the weighted variant of windowdiff
- Return type:
float