nltk.corpus.reader.nkjp module

class nltk.corpus.reader.nkjp.NKJPCorpusReader[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

WORDS_MODE = 0
SENTS_MODE = 1
HEADER_MODE = 2
RAW_MODE = 3
__init__(root, fileids='.*')[source]

Corpus reader designed to work with National Corpus of Polish. See http://nkjp.pl/ for more details about NKJP. use example: import nltk import nkjp from nkjp import NKJPCorpusReader x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’’) # obtain the whole corpus x.header() x.raw() x.words() x.tagged_words(tags=[‘subst’, ‘comp’]) #Link to find more tags: nkjp.pl/poliqarp/help/ense2.html x.sents() x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’Wilk*’) # obtain particular file(s) x.header(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’]) x.tagged_words(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’], tags=[‘subst’, ‘comp’])

get_paths()[source]
fileids()[source]

Returns a list of file identifiers for the fileids that make up this corpus.

add_root(fileid)[source]

Add root if necessary to specified fileid.

header(fileids=None, **kwargs)[source]

Returns header(s) of specified fileids.

sents(fileids=None, **kwargs)[source]

Returns sentences in specified fileids.

words(fileids=None, **kwargs)[source]

Returns words in specified fileids.

tagged_words(fileids=None, **kwargs)[source]

Call with specified tags as a list, e.g. tags=[‘subst’, ‘comp’]. Returns tagged words in specified fileids.

raw(fileids=None, **kwargs)[source]

Returns words in specified fileids.

class nltk.corpus.reader.nkjp.NKJPCorpus_Header_View[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

__init__(filename, **kwargs)[source]

HEADER_MODE A stream backed corpus view specialized for use with header.xml files in NKJP corpus.

handle_query()[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns

The view value corresponding to elt.

Parameters
  • elt (ElementTree) – The element that should be converted.

  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

class nltk.corpus.reader.nkjp.XML_Tool[source]

Bases: object

Helper class creating xml file to one without references to nkjp: namespace. That’s needed because the XMLCorpusView assumes that one can find short substrings of XML that are valid XML, which is not true if a namespace is declared at top level

__init__(root, filename)[source]
build_preprocessed_file()[source]
remove_preprocessed_file()[source]
class nltk.corpus.reader.nkjp.NKJPCorpus_Segmentation_View[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with ann_segmentation.xml files in NKJP corpus.

__init__(filename, **kwargs)[source]

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters
  • tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.

  • elt_handler

    A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:

    elt_handler(elt, tagspec) -> value
    

get_segm_id(example_word)[source]
get_sent_beg(beg_word)[source]
get_sent_end(end_word)[source]
get_sentences(sent_segm)[source]
remove_choice(segm)[source]
handle_query()[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns

The view value corresponding to elt.

Parameters
  • elt (ElementTree) – The element that should be converted.

  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

class nltk.corpus.reader.nkjp.NKJPCorpus_Text_View[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with text.xml files in NKJP corpus.

SENTS_MODE = 0
RAW_MODE = 1
__init__(filename, **kwargs)[source]

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters
  • tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.

  • elt_handler

    A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:

    elt_handler(elt, tagspec) -> value
    

handle_query()[source]
read_block(stream, tagspec=None, elt_handler=None)[source]

Returns text as a list of sentences.

get_segm_id(elt)[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns

The view value corresponding to elt.

Parameters
  • elt (ElementTree) – The element that should be converted.

  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

class nltk.corpus.reader.nkjp.NKJPCorpus_Morph_View[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with ann_morphosyntax.xml files in NKJP corpus.

__init__(filename, **kwargs)[source]

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters
  • tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.

  • elt_handler

    A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:

    elt_handler(elt, tagspec) -> value
    

handle_query()[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns

The view value corresponding to elt.

Parameters
  • elt (ElementTree) – The element that should be converted.

  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.