nltk.corpus.reader.ipipan module¶
- class nltk.corpus.reader.ipipan.IPIPANCorpusReader[source]¶
Bases:
CorpusReader
Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
The corpus includes information about text domain, channel and categories. You can access possible values using
domains()
,channels()
andcategories()
. You can use also this metadata to filter files, e.g.:fileids(channel='prasa')
,fileids(categories='publicystyczny')
.The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.:
tagged_sents(simplify_tags=True)
.Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.:
tagged_paras(one_tag=False)
.You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g.
tagged_words(disamb_only=False)
.The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g.
tagged_words(append_no_space=True)
. As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g.
words(append_space=True)
. As a result either ‘ ‘ or (’ ‘, ‘space’) will be inserted between tokens.By default, xml entities like " and & are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g.
words(replace_xmlentities=False)
.- __init__(root, fileids)[source]¶
- Parameters:
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.ipipan.IPIPANCorpusView[source]¶
Bases:
StreamBackedCorpusView
- PARAS_MODE = 2¶
- SENTS_MODE = 1¶
- WORDS_MODE = 0¶
- __init__(filename, startpos=0, **kwargs)[source]¶
Create a new corpus view, based on the file
fileid
, and read withblock_reader
. See the class documentation for more information.- Parameters:
fileid – The path to the file that is read by this corpus view.
fileid
can either be a string or aPathPointer
.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).