nltk.corpus.reader.mte module

A reader for corpora whose documents are in MTE format.

class nltk.corpus.reader.mte.MTECorpusReader[source]

Bases: TaggedCorpusReader

Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset

__init__(root=None, fileids=None, encoding='utf8')[source]

Construct a new MTECorpusreader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = MTECorpusReader(root, 'oana-*.xml', 'utf8') 
Parameters
  • root – The root directory for this corpus. (default points to location in multext config file)

  • fileids – A list or regexp specifying the fileids in this corpus. (default is oana-en.xml)

  • encoding – The encoding of the given files (default is utf8)

lemma_paras(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma)

Return type

list(List(List(tuple(str, str))))

lemma_sents(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma)

Return type

list(list(tuple(str, str)))

lemma_words(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma)

Return type

list(tuple(str,str))

paras(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string

Return type

list(list(list(str)))

sents(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings

Return type

list(list(str))

tagged_paras(fileids=None, tagset='msd', tags='')[source]
Parameters
  • fileids – A list specifying the fileids that should be used.

  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default

  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples

Return type

list(list(list(tuple(str, str))))

tagged_sents(fileids=None, tagset='msd', tags='')[source]
Parameters
  • fileids – A list specifying the fileids that should be used.

  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default

  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns

the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples

Return type

list(list(tuple(str, str)))

tagged_words(fileids=None, tagset='msd', tags='')[source]
Parameters
  • fileids – A list specifying the fileids that should be used.

  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default

  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns

the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)

Return type

list(tuple(str, str))

words(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.mte.MTECorpusView[source]

Bases: XMLCorpusView

Class for lazy viewing the MTE Corpus.

__init__(fileid, tagspec, elt_handler=None)[source]

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters
  • tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.

  • elt_handler

    A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:

    elt_handler(elt, tagspec) -> value
    

read_block(stream, tagspec=None, elt_handler=None)[source]

Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

class nltk.corpus.reader.mte.MTEFileReader[source]

Bases: object

Class for loading the content of the multext-east corpus. It parses the xml files and does some tag-filtering depending on the given method parameters.

__init__(file_path)[source]
lemma_paras()[source]
lemma_sents()[source]
lemma_words()[source]
ns = {'tei': 'https://www.tei-c.org/ns/1.0', 'xml': 'https://www.w3.org/XML/1998/namespace'}
para_path = 'TEI/text/body/div/div/p'
paras()[source]
sent_path = 'TEI/text/body/div/div/p/s'
sents()[source]
tag_ns = '{https://www.tei-c.org/ns/1.0}'
tagged_paras(tagset, tags)[source]
tagged_sents(tagset, tags)[source]
tagged_words(tagset, tags)[source]
word_path = 'TEI/text/body/div/div/p/s/(w|c)'
words()[source]
xml_ns = '{https://www.w3.org/XML/1998/namespace}'
class nltk.corpus.reader.mte.MTETagConverter[source]

Bases: object

Class for converting msd tags to universal tags, more conversion options are currently not implemented.

mapping_msd_universal = {'-': 'X', '.': '.', 'A': 'ADJ', 'C': 'CONJ', 'D': 'DET', 'M': 'NUM', 'N': 'NOUN', 'P': 'PRON', 'Q': 'PRT', 'R': 'ADV', 'S': 'ADP', 'V': 'VERB'}
static msd_to_universal(tag)[source]

This function converts the annotation from the Multex-East to the universal tagset as described in Chapter 5 of the NLTK-Book

Unknown Tags will be mapped to X. Punctuation marks are not supported in MSD tags, so

nltk.corpus.reader.mte.xpath(root, path, ns)[source]