nltk.corpus.reader.indian module

Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission

Contents:
  • Bangla: IIT Kharagpur

  • Hindi: Microsoft Research India

  • Marathi: IIT Bombay

  • Telugu: IIIT Hyderabad

class nltk.corpus.reader.indian.IndianCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

words(fileids=None)[source]
tagged_words(fileids=None, tagset=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
class nltk.corpus.reader.indian.IndianCorpusView[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

__init__(corpus_file, encoding, tagged, group_by_sent, tag_mapping_function=None)[source]

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
  • fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.

  • startpos – The file position at which the view will start reading. This can be used to skip over preface sections.

  • encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]

Read a block from the input stream.

Returns

a block of tokens from the input stream

Return type

list(any)

Parameters

stream (stream) – an input stream