Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.
The files were taken from the RTE1, RTE2 and RTE3 datasets and the files were regularized.
Filenames are of the form rte*_dev.xml and rte*_test.xml. The latter are the gold standard annotated files.
Each entailment corpus is a list of ‘text’/’hypothesis’ pairs. The following example is taken from RTE3:
<pair id="1" entailment="YES" task="IE" length="short" > <t>The sale was made to pay Yukos' US$ 27.5 billion tax bill, Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known company Baikalfinansgroup which was later bought by the Russian state-owned oil company Rosneft .</t> <h>Baikalfinansgroup was sold to Rosneft.</h> </pair>
In order to provide globally unique IDs for each pair, a new attribute
challenge has been added to the root element
entailment-corpus of each
file, taking values 1, 2 or 3. The GID is formatted ‘m-n’, where ‘m’ is the
challenge number and ‘n’ is the pair ID.
Normalize the string value in an RTE pair’s
entailmentattribute as an integer (1, 0).
value_string (str) – the label used to classify a text/hypothesis pair
- Return type
- class nltk.corpus.reader.rte.RTEPair¶
Container for RTE text-hypothesis pairs.
The entailment relation is signalled by the
valueattribute in RTE1, and by
entailmentin RTE2 and RTE3. These both get mapped on to the
entailmentattribute of this class.
- __init__(pair, challenge=None, id=None, text=None, hyp=None, value=None, task=None, length=None)¶
challenge – version of the RTE challenge (i.e., RTE1, RTE2 or RTE3)
id – identifier for the pair
text – the text component of the pair
hyp – the hypothesis component of the pair
value – classification label for the pair
task – attribute for the particular NLP task that the data was drawn from
length – attribute for the length of the text of the pair
- class nltk.corpus.reader.rte.RTECorpusReader¶
Corpus reader for corpora in RTE challenges.
This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.