nltk.corpus.reader.pros_cons module¶

CorpusReader for the Pros and Cons dataset.

Pros and Cons dataset information -

Contact: Bing Liu, liub@cs.uic.edu: https://www.cs.uic.edu/~liub

Distributed with permission.

Related papers:

Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing
Opinions on the Web”. Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.

class nltk.corpus.reader.pros_cons.ProsConsCorpusReader[source]¶

Bases: CategorizedCorpusReader, CorpusReader

Reader for the Pros and Cons sentence dataset.

>>> from nltk.corpus import pros_cons
>>> pros_cons.sents(categories='Cons') 
[['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy',
'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'],
...]
>>> pros_cons.words('IntegratedPros.txt')
['Easy', 'to', 'use', ',', 'economical', '!', ...]

CorpusView¶: alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8', **kwargs)[source]¶

Parameters:

root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
encoding – the encoding that should be used to read the corpus.
kwargs – additional parameters passed to CategorizedCorpusReader.

sents(fileids=None, categories=None)[source]¶

Return all sentences in the corpus or in the specified files/categories.

Parameters:

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
categories – a list specifying the categories whose sentences have to be returned.

Returns:

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type:

list(list(str))

words(fileids=None, categories=None)[source]¶

Return all words and punctuation symbols in the corpus or in the specified files/categories.

Parameters:

fileids – a list or regexp specifying the ids of the files whose words have to be returned.
categories – a list specifying the categories whose words have to be returned.

Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

NLTK

Documentation

nltk.corpus.reader.pros_cons module¶