nltk.corpus.reader.pros_cons module

CorpusReader for the Pros and Cons dataset.

  • Pros and Cons dataset information -

Contact: Bing Liu, liub@cs.uic.edu

https://www.cs.uic.edu/~liub

Distributed with permission.

Related papers:

  • Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.

    Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.

  • Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing

    Opinions on the Web”. Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.

class nltk.corpus.reader.pros_cons.ProsConsCorpusReader[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.api.CorpusReader

Reader for the Pros and Cons sentence dataset.

>>> from nltk.corpus import pros_cons
>>> pros_cons.sents(categories='Cons')
[['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy',
'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'],
...]
>>> pros_cons.words('IntegratedPros.txt')
['Easy', 'to', 'use', ',', 'economical', '!', ...]
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8', **kwargs)[source]
Parameters
  • root – The root directory for the corpus.

  • fileids – a list or regexp specifying the fileids in the corpus.

  • word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer

  • encoding – the encoding that should be used to read the corpus.

  • kwargs – additional parameters passed to CategorizedCorpusReader.

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified files/categories.

Parameters
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

  • categories – a list specifying the categories whose sentences have to be returned.

Returns

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files/categories.

Parameters
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.

  • categories – a list specifying the categories whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)