nltk.metrics.paice module

Counts Paice’s performance statistics for evaluating stemming algorithms.

What is required:
  • A dictionary of words grouped by their real lemmas

  • A dictionary of words grouped by stems from a stemming algorithm

When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.

References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42–50.

class nltk.metrics.paice.Paice[source]

Bases: object

Class for storing lemmas, stems and evaluation metrics.

__init__(lemmas, stems)[source]
Parameters:
  • lemmas (dict(str): list(str)) – A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma.

  • stems (dict(str): set(str)) – A dictionary where keys are stems and values are sets or lists of words corresponding to that stem.

update()[source]

Update statistics after lemmas and stems have been set.

nltk.metrics.paice.demo()[source]

Demonstration of the module.

nltk.metrics.paice.get_words_from_dictionary(lemmas)[source]

Get original set of words used for analysis.

Parameters:

lemmas (dict(str): list(str)) – A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma.

Returns:

Set of words that exist as values in the dictionary

Return type:

set(str)