nltk.corpus.reader.categorized_sents module¶
CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset.
Subjectivity Dataset information -
Authors: Bo Pang and Lillian Lee. Url: https://www.cs.cornell.edu/people/pabo/movie-review-data
Distributed with permission.
Related papers:
- Bo Pang and Lillian Lee. “A Sentimental Education: Sentiment Analysis Using
Subjectivity Summarization Based on Minimum Cuts”. Proceedings of the ACL, 2004.
Sentence Polarity Dataset information -
Authors: Bo Pang and Lillian Lee. Url: https://www.cs.cornell.edu/people/pabo/movie-review-data
Related papers:
- Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for
sentiment categorization with respect to rating scales”. Proceedings of the ACL, 2005.
- class nltk.corpus.reader.categorized_sents.CategorizedSentencesCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,CorpusReader
A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.
Examples using the Subjectivity Dataset:
>>> from nltk.corpus import subjectivity >>> subjectivity.sents()[23] ['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits', 'happened', 'off', 'screen', '.'] >>> subjectivity.categories() ['obj', 'subj'] >>> subjectivity.words(categories='subj') ['smart', 'and', 'alert', ',', 'thirteen', ...]
Examples using the Sentence Polarity Dataset:
>>> from nltk.corpus import sentence_polarity >>> sentence_polarity.sents() [['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.'], ...] >>> sentence_polarity.categories() ['neg', 'pos']
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8', **kwargs)[source]¶
- Parameters:
root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
sent_tokenizer – a tokenizer for breaking paragraphs into sentences.
encoding – the encoding that should be used to read the corpus.
kwargs – additional parameters passed to CategorizedCorpusReader.
- sents(fileids=None, categories=None)[source]¶
Return all sentences in the corpus or in the specified file(s).
- Parameters:
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
categories – a list specifying the categories whose sentences have to be returned.
- Returns:
the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
- Return type:
list(list(str))
- words(fileids=None, categories=None)[source]¶
Return all words and punctuation symbols in the corpus or in the specified file(s).
- Parameters:
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
categories – a list specifying the categories whose words have to be returned.
- Returns:
the given file(s) as a list of words and punctuation symbols.
- Return type:
list(str)