nltk.corpus.reader.ipipan module

class nltk.corpus.reader.ipipan.IPIPANCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories. You can access possible values using domains(), channels() and categories(). You can use also this metadata to filter files, e.g.: fileids(channel='prasa'), fileids(categories='publicystyczny').

The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.: tagged_sents(simplify_tags=True).

Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.: tagged_paras(one_tag=False).

You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g. tagged_words(disamb_only=False).

The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g. tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.

The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g. words(append_space=True). As a result either ‘ ‘ or (’ ‘, ‘space’) will be inserted between tokens.

By default, xml entities like " and & are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g. words(replace_xmlentities=False).

__init__(root, fileids)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

channels(fileids=None)[source]
domains(fileids=None)[source]
categories(fileids=None)[source]
fileids(channels=None, domains=None, categories=None)[source]

Return a list of file identifiers for the fileids that make up this corpus.

sents(fileids=None, **kwargs)[source]
paras(fileids=None, **kwargs)[source]
words(fileids=None, **kwargs)[source]
tagged_sents(fileids=None, **kwargs)[source]
tagged_paras(fileids=None, **kwargs)[source]
tagged_words(fileids=None, **kwargs)[source]
class nltk.corpus.reader.ipipan.IPIPANCorpusView[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

WORDS_MODE = 0
SENTS_MODE = 1
PARAS_MODE = 2
__init__(filename, startpos=0, **kwargs)[source]

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
  • fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.

  • startpos – The file position at which the view will start reading. This can be used to skip over preface sections.

  • encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]

Read a block from the input stream.

Returns

a block of tokens from the input stream

Return type

list(any)

Parameters

stream (stream) – an input stream