nltk.corpus.reader.plaintext module

A reader for corpora that consist of plaintext documents.

class nltk.corpus.reader.plaintext.PlaintextCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/usr/local/share/nltk_data/corpora/webtext/'
>>> reader = PlaintextCorpusReader(root, '.*\.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.

  • sent_tokenizer – Tokenizer for breaking paragraphs into words.

  • para_block_reader – The block reader used to divide the corpus into paragraph blocks.

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

class nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.plaintext.PlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the PlaintextCorpusReader constructor.

class nltk.corpus.reader.plaintext.PortugueseCategorizedPlaintextCorpusReader[source]

Bases: nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the PlaintextCorpusReader constructor.

class nltk.corpus.reader.plaintext.EuroparlCorpusReader[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from PlaintextCorpusReader except that:

  • Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.

  • For the same reason, the sentence tokenizer should just split the paragraph at line breaks.

  • There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.

  • The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.

chapters(fileids=None)[source]
Returns

the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))