nltk.corpus.reader.plaintext module¶
A reader for corpora that consist of plaintext documents.
- class nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,PlaintextCorpusReader
A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
- class nltk.corpus.reader.plaintext.EuroparlCorpusReader[source]¶
Bases:
PlaintextCorpusReader
Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from
PlaintextCorpusReader
except that:Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
- class nltk.corpus.reader.plaintext.PlaintextCorpusReader[source]¶
Bases:
CorpusReader
Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the
CorpusView
class variable.- CorpusView¶
The corpus view class used by this reader. Subclasses of
PlaintextCorpusReader
may specify alternative corpus view classes (e.g., to skip the preface sections of documents.)alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=None, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶
Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt')
- Parameters:
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer – Tokenizer for breaking paragraphs into words.
para_block_reader – The block reader used to divide the corpus into paragraph blocks.
- paras(fileids=None)[source]¶
- Returns:
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type:
list(list(list(str)))