nltk.metrics.segmentation module

Text Segmentation Metrics

  1. Windowdiff

Pevzner, L., and Hearst, M., A Critique and Improvement of

an Evaluation Metric for Text Segmentation, Computational Linguistics 28, 19-36

  1. Generalized Hamming Distance

Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353-375

Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html

Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l’efficacite en segmentation de textes ? TALN 2009

  1. Pk text segmentation metric

Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177-210

nltk.metrics.segmentation.ghd(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1')[source]

Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.

A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.

>>> # Same examples as Kulyukin C++ implementation
>>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5)
0.5
>>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5)
2.0
>>> ghd('011', '110', 1.0, 1.0, 0.5)
1.0
>>> ghd('1', '0', 1.0, 1.0, 0.5)
1.0
>>> ghd('111', '000', 1.0, 1.0, 0.5)
3.0
>>> ghd('000', '111', 1.0, 2.0, 0.5)
6.0
Parameters:
  • ref (str or list) – the reference segmentation

  • hyp (str or list) – the hypothetical segmentation

  • ins_cost (float) – insertion cost

  • del_cost (float) – deletion cost

  • shift_cost_coeff (float) – constant used to compute the cost of a shift. shift cost = shift_cost_coeff * |i - j| where i and j are the positions indicating the shift

  • boundary (str or int or bool) – boundary value

Return type:

float

nltk.metrics.segmentation.pk(ref, hyp, k=None, boundary='1')[source]

Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

>>> '%.2f' % pk('0100'*100, '1'*400, 2)
'0.50'
>>> '%.2f' % pk('0100'*100, '0'*400, 2)
'0.50'
>>> '%.2f' % pk('0100'*100, '0100'*100, 2)
'0.00'
Parameters:
  • ref (str or list) – the reference segmentation

  • hyp (str or list) – the segmentation to evaluate

  • k – window size, if None, set to half of the average reference segment length

  • boundary (str or int or bool) – boundary value

Return type:

float

nltk.metrics.segmentation.windowdiff(seg1, seg2, k, boundary='1', weighted=False)[source]

Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

>>> s1 = "000100000010"
>>> s2 = "000010000100"
>>> s3 = "100000010000"
>>> '%.2f' % windowdiff(s1, s1, 3)
'0.00'
>>> '%.2f' % windowdiff(s1, s2, 3)
'0.30'
>>> '%.2f' % windowdiff(s2, s3, 3)
'0.80'
Parameters:
  • seg1 (str or list) – a segmentation

  • seg2 (str or list) – a segmentation

  • k (int) – window width

  • boundary (str or int or bool) – boundary value

  • weighted (boolean) – use the weighted variant of windowdiff

Return type:

float