nltk.corpus.reader.mte module¶

A reader for corpora whose documents are in MTE format.

class nltk.corpus.reader.mte.MTECorpusReader[source]¶

Bases: TaggedCorpusReader

Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset

__init__(root=None, fileids=None, encoding='utf8')[source]¶

Construct a new MTECorpusreader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = MTECorpusReader(root, 'oana-*.xml', 'utf8') 

Parameters:

root – The root directory for this corpus. (default points to location in multext config file)
fileids – A list or regexp specifying the fileids in this corpus. (default is oana-en.xml)
encoding – The encoding of the given files (default is utf8)

lemma_paras(fileids=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
Return type:: list(List(List(tuple(str, str))))

lemma_sents(fileids=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
Return type:: list(list(tuple(str, str)))

lemma_words(fileids=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma)
Return type:: list(tuple(str,str))

paras(fileids=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string
Return type:: list(list(list(str)))

sents(fileids=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings
Return type:: list(list(str))

tagged_paras(fileids=None, tagset='msd', tags='')[source]¶

Parameters:

fileids – A list specifying the fileids that should be used.
tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns:

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples

Return type:

list(list(list(tuple(str, str))))

tagged_sents(fileids=None, tagset='msd', tags='')[source]¶

Parameters:

fileids – A list specifying the fileids that should be used.
tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns:

the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples

Return type:

list(list(tuple(str, str)))

tagged_words(fileids=None, tagset='msd', tags='')[source]¶

Parameters:

fileids – A list specifying the fileids that should be used.
tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns:

the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)

Return type:

list(tuple(str, str))

words(fileids=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a list of words and punctuation symbols.
Return type:: list(str)

class nltk.corpus.reader.mte.MTECorpusView[source]¶

Bases: XMLCorpusView

Class for lazy viewing the MTE Corpus.

__init__(fileid, tagspec, elt_handler=None)[source]¶

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters:

tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:
```
elt_handler(elt, tagspec) -> value
```

read_block(stream, tagspec=None, elt_handler=None)[source]¶: Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

class nltk.corpus.reader.mte.MTEFileReader[source]¶

Bases: object

Class for loading the content of the multext-east corpus. It parses the xml files and does some tag-filtering depending on the given method parameters.

__init__(file_path)[source]¶

lemma_paras()[source]¶

lemma_sents()[source]¶

lemma_words()[source]¶

ns = {'tei': 'https://www.tei-c.org/ns/1.0', 'xml': 'https://www.w3.org/XML/1998/namespace'}¶

para_path = 'TEI/text/body/div/div/p'¶

paras()[source]¶

sent_path = 'TEI/text/body/div/div/p/s'¶

sents()[source]¶

tag_ns = '{https://www.tei-c.org/ns/1.0}'¶

tagged_paras(tagset, tags)[source]¶

tagged_sents(tagset, tags)[source]¶

tagged_words(tagset, tags)[source]¶

word_path = 'TEI/text/body/div/div/p/s/(w|c)'¶

words()[source]¶

xml_ns = '{https://www.w3.org/XML/1998/namespace}'¶

class nltk.corpus.reader.mte.MTETagConverter[source]¶

Bases: object

Class for converting msd tags to universal tags, more conversion options are currently not implemented.

mapping_msd_universal = {'-': 'X', '.': '.', 'A': 'ADJ', 'C': 'CONJ', 'D': 'DET', 'M': 'NUM', 'N': 'NOUN', 'P': 'PRON', 'Q': 'PRT', 'R': 'ADV', 'S': 'ADP', 'V': 'VERB'}¶

static msd_to_universal(tag)[source]¶

This function converts the annotation from the Multex-East to the universal tagset as described in Chapter 5 of the NLTK-Book

Unknown Tags will be mapped to X. Punctuation marks are not supported in MSD tags, so

nltk.corpus.reader.mte.xpath(root, path, ns)[source]¶