nltk.corpus.reader.plaintext module¶
A reader for corpora that consist of plaintext documents.
- class nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader[source]¶
Bases:
CategorizedCorpusReader,PlaintextCorpusReaderA reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
- class nltk.corpus.reader.plaintext.EuroparlCorpusReader[source]¶
Bases:
PlaintextCorpusReaderReader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from
PlaintextCorpusReaderexcept that:Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
- class nltk.corpus.reader.plaintext.PlaintextCorpusReader[source]¶
Bases:
CorpusReaderReader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the
CorpusViewclass variable.- CorpusView¶
The corpus view class used by this reader. Subclasses of
PlaintextCorpusReadermay specify alternative corpus view classes (e.g., to skip the preface sections of documents.)alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=None, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶
Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt')
- Parameters:
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer – Tokenizer for breaking paragraphs into words.
para_block_reader – The block reader used to divide the corpus into paragraph blocks.
- paras(fileids=None)[source]¶
- Returns:
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type:
list(list(list(str)))
- class nltk.corpus.reader.plaintext.PortugueseCategorizedPlaintextCorpusReader[source]¶
Bases:
CategorizedPlaintextCorpusReaderThis class is identical with CategorizedPlaintextCorpusReader, except that it initializes a Portuguese PunktTokenizer:
>>> from nltk.corpus import machado >>> print(machado._sent_tokenizer._lang) portuguese