nltk.corpus.reader.nkjp module¶

class nltk.corpus.reader.nkjp.NKJPCorpusReader[source]¶

Bases: XMLCorpusReader

HEADER_MODE = 2¶

RAW_MODE = 3¶

SENTS_MODE = 1¶

WORDS_MODE = 0¶

__init__(root, fileids='.*')[source]¶: Corpus reader designed to work with National Corpus of Polish. See http://nkjp.pl/ for more details about NKJP. use example: import nltk import nkjp from nkjp import NKJPCorpusReader x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’’) # obtain the whole corpus x.header() x.raw() x.words() x.tagged_words(tags=[‘subst’, ‘comp’]) #Link to find more tags: nkjp.pl/poliqarp/help/ense2.html x.sents() x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’Wilk*’) # obtain particular file(s) x.header(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’]) x.tagged_words(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’], tags=[‘subst’, ‘comp’])

add_root(fileid)[source]¶: Add root if necessary to specified fileid.

fileids()[source]¶: Returns a list of file identifiers for the fileids that make up this corpus.

get_paths()[source]¶

header(fileids=None, **kwargs)[source]¶: Returns header(s) of specified fileids.

raw(fileids=None, **kwargs)[source]¶: Returns words in specified fileids.

sents(fileids=None, **kwargs)[source]¶: Returns sentences in specified fileids.

tagged_words(fileids=None, **kwargs)[source]¶: Call with specified tags as a list, e.g. tags=[‘subst’, ‘comp’]. Returns tagged words in specified fileids.

words(fileids=None, **kwargs)[source]¶: Returns words in specified fileids.

class nltk.corpus.reader.nkjp.NKJPCorpus_Header_View[source]¶

Bases: XMLCorpusView

__init__(filename, **kwargs)[source]¶: HEADER_MODE A stream backed corpus view specialized for use with header.xml files in NKJP corpus.

handle_elt(elt, context)[source]¶

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:

elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

handle_query()[source]¶

class nltk.corpus.reader.nkjp.NKJPCorpus_Morph_View[source]¶

Bases: XMLCorpusView

A stream backed corpus view specialized for use with ann_morphosyntax.xml files in NKJP corpus.

__init__(filename, **kwargs)[source]¶

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters:

tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:
```
elt_handler(elt, tagspec) -> value
```

handle_elt(elt, context)[source]¶

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:

elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

handle_query()[source]¶

class nltk.corpus.reader.nkjp.NKJPCorpus_Segmentation_View[source]¶

Bases: XMLCorpusView

A stream backed corpus view specialized for use with ann_segmentation.xml files in NKJP corpus.

__init__(filename, **kwargs)[source]¶

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters:

tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:
```
elt_handler(elt, tagspec) -> value
```

get_segm_id(example_word)[source]¶

get_sent_beg(beg_word)[source]¶

get_sent_end(end_word)[source]¶

get_sentences(sent_segm)[source]¶

handle_elt(elt, context)[source]¶

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:

elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

handle_query()[source]¶

remove_choice(segm)[source]¶

class nltk.corpus.reader.nkjp.NKJPCorpus_Text_View[source]¶

Bases: XMLCorpusView

A stream backed corpus view specialized for use with text.xml files in NKJP corpus.

RAW_MODE = 1¶

SENTS_MODE = 0¶

__init__(filename, **kwargs)[source]¶

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters:

tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:
```
elt_handler(elt, tagspec) -> value
```

get_segm_id(elt)[source]¶

handle_elt(elt, context)[source]¶

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:

elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

handle_query()[source]¶

read_block(stream, tagspec=None, elt_handler=None)[source]¶: Returns text as a list of sentences.

class nltk.corpus.reader.nkjp.XML_Tool[source]¶

Bases: object

Helper class creating xml file to one without references to nkjp: namespace. That’s needed because the XMLCorpusView assumes that one can find short substrings of XML that are valid XML, which is not true if a namespace is declared at top level

__init__(root, filename)[source]¶

build_preprocessed_file()[source]¶

remove_preprocessed_file()[source]¶

NLTK

Documentation

nltk.corpus.reader.nkjp module¶