nltk.corpus.reader.reviews module

CorpusReader for reviews corpora (syntax based on Customer Review Corpus).

Customer Review Corpus information

Annotated by: Minqing Hu and Bing Liu, 2004.

Department of Computer Science University of Illinois at Chicago

Contact: Bing Liu, liub@cs.uic.edu

https://www.cs.uic.edu/~liub

Distributed with permission.

The “product_reviews_1” and “product_reviews_2” datasets respectively contain annotated customer reviews of 5 and 9 products from amazon.com.

Related papers:

  • Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.

    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), 2004.

  • Minqing Hu and Bing Liu. “Mining Opinion Features in Customer Reviews”.

    Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), 2004.

  • Xiaowen Ding, Bing Liu and Philip S. Yu. “A Holistic Lexicon-Based Appraoch to

    Opinion Mining.” Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA.

Symbols used in the annotated reviews:

[t]

the title of the review: Each [t] tag starts a review.

xxxx[+|-n]

xxxx is a product feature.

[+n]

Positive opinion, n is the opinion strength: 3 strongest, and 1 weakest. Note that the strength is quite subjective. You may want ignore it, but only considering + and -

[-n]

Negative opinion

##

start of each sentence. Each line is a sentence.

[u]

feature not appeared in the sentence.

[p]

feature not appeared in the sentence. Pronoun resolution is needed.

[s]

suggestion or recommendation.

[cc]

comparison with a competing product from a different brand.

[cs]

comparison with a competing product from the same brand.

Note: Some of the files (e.g. “ipod.txt”, “Canon PowerShot SD500.txt”) do not

provide separation between different reviews. This is due to the fact that the dataset was specifically designed for aspect/feature-based sentiment analysis, for which sentence-level annotation is sufficient. For document- level classification and analysis, this peculiarity should be taken into consideration.

class nltk.corpus.reader.reviews.Review[source]

Bases: object

A Review is the main block of a ReviewsCorpusReader.

__init__(title=None, review_lines=None)[source]
Parameters
  • title – the title of the review.

  • review_lines – the list of the ReviewLines that belong to the Review.

add_line(review_line)[source]

Add a line (ReviewLine) to the review.

Parameters

review_line – a ReviewLine instance that belongs to the Review.

features()[source]

Return a list of features in the review. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Returns

all features of the review as a list of tuples (feat, score).

Return type

list(tuple)

sents()[source]

Return all tokenized sentences in the review.

Returns

all sentences of the review as lists of tokens.

Return type

list(list(str))

class nltk.corpus.reader.reviews.ReviewLine[source]

Bases: object

A ReviewLine represents a sentence of the review, together with (optional) annotations of its features and notes about the reviewed item.

__init__(sent, features=None, notes=None)[source]
class nltk.corpus.reader.reviews.ReviewsCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.

>>> from nltk.corpus import product_reviews_1
>>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt')
>>> review = camera_reviews[0]
>>> review.sents()[0]
['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am',
'extremely', 'satisfied', 'with', 'the', 'purchase', '.']
>>> review.features()
[('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'),
('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'),
('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'),
('option', '+1')]

We can also reach the same information directly from the stream:

>>> product_reviews_1.features('Canon_G3.txt')
[('canon powershot g3', '+3'), ('use', '+2'), ...]

We can compute stats for specific product features:

>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> mean = tot / n_reviews
>>> print(n_reviews, tot, mean)
15 24 1.6
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8')[source]
Parameters
  • root – The root directory for the corpus.

  • fileids – a list or regexp specifying the fileids in the corpus.

  • word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WordPunctTokenizer

  • encoding – the encoding that should be used to read the corpus.

features(fileids=None)[source]

Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Parameters

fileids – a list or regexp specifying the ids of the files whose features have to be returned.

Returns

all features for the item(s) in the given file(s).

Return type

list(tuple)

reviews(fileids=None)[source]

Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.

Parameters

fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.

Returns

the given file(s) as a list of reviews.

sents(fileids=None)[source]

Return all sentences in the corpus or in the specified files.

Parameters

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

Returns

the given file(s) as a list of sentences, each encoded as a list of word strings.

Return type

list(list(str))

words(fileids=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files.

Parameters

fileids – a list or regexp specifying the ids of the files whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)