nltk.metrics.paice module

Counts Paice’s performance statistics for evaluating stemming algorithms.

What is required:
  • A dictionary of words grouped by their real lemmas

  • A dictionary of words grouped by stems from a stemming algorithm

When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.

References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42–50.

class nltk.metrics.paice.Paice[source]

Bases: object

Class for storing lemmas, stems and evaluation metrics.

__init__(lemmas, stems)[source]
Parameters
  • lemmas (dict(str): list(str)) – A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma.

  • stems (dict(str): set(str)) – A dictionary where keys are stems and values are sets or lists of words corresponding to that stem.

update()[source]

Update statistics after lemmas and stems have been set.

nltk.metrics.paice.demo()[source]

Demonstration of the module.

nltk.metrics.paice.get_words_from_dictionary(lemmas)[source]

Get original set of words used for analysis.

Parameters

lemmas (dict(str): list(str)) – A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma.

Returns

Set of words that exist as values in the dictionary

Return type

set(str)