nltk.corpus.reader.knbc module

class nltk.corpus.reader.knbc.KNBCorpusReader[source]

Bases: SyntaxCorpusReader

This class implements:
  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.

  • _read_block, which reads a block from the input stream.

  • _word, which takes a block and returns a list of list of words.

  • _tag, which takes a block and returns a list of list of tagged words.

  • _parse, which takes a block and returns a list of parsed sentences.

The structure of tagged words:

tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others …)

Usage example

>>> from nltk.corpus.util import LazyCorpusLoader
>>> knbc = LazyCorpusLoader(
...     'knbc/corpus1',
...     KNBCorpusReader,
...     r'.*/KN.*',
...     encoding='euc-jp',
... )
>>> len(knbc.sents()[0])
9
__init__(root, fileids, encoding='utf8', morphs2str=<function <lambda>>)[source]

Initialize KNBCorpusReader morphs2str is a function to convert morphlist to str for tree representation for _parse()

nltk.corpus.reader.knbc.demo()[source]
nltk.corpus.reader.knbc.test()[source]