nltk.classify.rte_classify module

Simple classifier for RTE corpus.

It calculates the overlap in words and named entities between text and hypothesis, and also whether there are words / named entities in the hypothesis which fail to occur in the text, since this is an indicator that the hypothesis is more informative than (i.e not entailed by) the text.

TO DO: better Named Entity classification TO DO: add lemmatization

class nltk.classify.rte_classify.RTEFeatureExtractor[source]

Bases: object

This builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.

__init__(rtepair, stop=True, use_lemmatize=False)[source]
Parameters:
  • rtepair – a RTEPair from which features should be extracted

  • stop (bool) – if True, stopwords are thrown away.

hyp_extra(toktype, debug=True)[source]

Compute the extraneous material in the hypothesis.

Parameters:

toktype ('ne' or 'word') – distinguish Named Entities from ordinary words

overlap(toktype, debug=False)[source]

Compute the overlap between text and hypothesis.

Parameters:

toktype ('ne' or 'word') – distinguish Named Entities from ordinary words

nltk.classify.rte_classify.rte_classifier(algorithm, sample_N=None)[source]
nltk.classify.rte_classify.rte_features(rtepair)[source]
nltk.classify.rte_classify.rte_featurize(rte_pairs)[source]