nltk.corpus.reader.categorized_sents module¶

CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset.

Subjectivity Dataset information -

Authors: Bo Pang and Lillian Lee. Url: https://www.cs.cornell.edu/people/pabo/movie-review-data

Distributed with permission.

Related papers:

Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for
sentiment categorization with respect to rating scales”. Proceedings of the ACL, 2005.

class nltk.corpus.reader.categorized_sents.CategorizedSentencesCorpusReader[source]¶

Bases: CategorizedCorpusReader, CorpusReader

A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.

Examples using the Subjectivity Dataset:

>>> from nltk.corpus import subjectivity
>>> subjectivity.sents()[23] 
['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits',
'happened', 'off', 'screen', '.']
>>> subjectivity.categories()
['obj', 'subj']
>>> subjectivity.words(categories='subj')
['smart', 'and', 'alert', ',', 'thirteen', ...]

Examples using the Sentence Polarity Dataset:

>>> from nltk.corpus import sentence_polarity
>>> sentence_polarity.sents() 
[['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish',
'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find',
'it', 'funny', '.'], ...]
>>> sentence_polarity.categories()
['neg', 'pos']

CorpusView¶: alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8', **kwargs)[source]¶

Parameters

root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
sent_tokenizer – a tokenizer for breaking paragraphs into sentences.
encoding – the encoding that should be used to read the corpus.
kwargs – additional parameters passed to CategorizedCorpusReader.

sents(fileids=None, categories=None)[source]¶

Return all sentences in the corpus or in the specified file(s).

Parameters

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
categories – a list specifying the categories whose sentences have to be returned.

Returns

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type

list(list(str))

words(fileids=None, categories=None)[source]¶

Return all words and punctuation symbols in the corpus or in the specified file(s).

Parameters

fileids – a list or regexp specifying the ids of the files whose words have to be returned.
categories – a list specifying the categories whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

NLTK

Documentation

nltk.corpus.reader.categorized_sents module¶