nltk.corpus.reader.reviews module¶

CorpusReader for reviews corpora (syntax based on Customer Review Corpus).

Customer Review Corpus information¶

Annotated by: Minqing Hu and Bing Liu, 2004.: Department of Computer Science University of Illinois at Chicago
Contact: Bing Liu, liub@cs.uic.edu: https://www.cs.uic.edu/~liub

Distributed with permission.

The “product_reviews_1” and “product_reviews_2” datasets respectively contain annotated customer reviews of 5 and 9 products from amazon.com.

Related papers:

Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), 2004.
Minqing Hu and Bing Liu. “Mining Opinion Features in Customer Reviews”.
Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), 2004.
Xiaowen Ding, Bing Liu and Philip S. Yu. “A Holistic Lexicon-Based Appraoch to
Opinion Mining.” Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA.

Symbols used in the annotated reviews:

[t]:

the title of the review: Each [t] tag starts a review.

xxxx[+|-n]:

xxxx is a product feature.

[+n]:

Positive opinion, n is the opinion strength: 3 strongest, and 1 weakest. Note that the strength is quite subjective. You may want ignore it, but only considering + and -

[-n]:

Negative opinion

##:

start of each sentence. Each line is a sentence.

[u]:

feature not appeared in the sentence.

[p]:

feature not appeared in the sentence. Pronoun resolution is needed.

[s]:

suggestion or recommendation.

[cc]:

comparison with a competing product from a different brand.

[cs]:

comparison with a competing product from the same brand.

Note: Some of the files (e.g. “ipod.txt”, “Canon PowerShot SD500.txt”) do not: provide separation between different reviews. This is due to the fact that the dataset was specifically designed for aspect/feature-based sentiment analysis, for which sentence-level annotation is sufficient. For document- level classification and analysis, this peculiarity should be taken into consideration.

class nltk.corpus.reader.reviews.Review[source]¶

Bases: object

A Review is the main block of a ReviewsCorpusReader.

__init__(title=None, review_lines=None)[source]¶

Parameters:

title – the title of the review.
review_lines – the list of the ReviewLines that belong to the Review.

add_line(review_line)[source]¶

Add a line (ReviewLine) to the review.

Parameters:: review_line – a ReviewLine instance that belongs to the Review.

features()[source]¶

Return a list of features in the review. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Returns:: all features of the review as a list of tuples (feat, score).
Return type:: list(tuple)

sents()[source]¶

Return all tokenized sentences in the review.

Returns:: all sentences of the review as lists of tokens.
Return type:: list(list(str))

class nltk.corpus.reader.reviews.ReviewLine[source]¶

Bases: object

A ReviewLine represents a sentence of the review, together with (optional) annotations of its features and notes about the reviewed item.

__init__(sent, features=None, notes=None)[source]¶

class nltk.corpus.reader.reviews.ReviewsCorpusReader[source]¶

Bases: CorpusReader

Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.

>>> from nltk.corpus import product_reviews_1
>>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt')
>>> review = camera_reviews[0]
>>> review.sents()[0] 
['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am',
'extremely', 'satisfied', 'with', 'the', 'purchase', '.']
>>> review.features() 
[('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'),
('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'),
('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'),
('option', '+1')]

We can also reach the same information directly from the stream:

>>> product_reviews_1.features('Canon_G3.txt')
[('canon powershot g3', '+3'), ('use', '+2'), ...]

We can compute stats for specific product features:

>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> mean = tot / n_reviews
>>> print(n_reviews, tot, mean)
15 24 1.6

CorpusView¶: alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8')[source]¶

Parameters:

root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WordPunctTokenizer
encoding – the encoding that should be used to read the corpus.

features(fileids=None)[source]¶

Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Parameters:: fileids – a list or regexp specifying the ids of the files whose features have to be returned.
Returns:: all features for the item(s) in the given file(s).
Return type:: list(tuple)

reviews(fileids=None)[source]¶

Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.

Parameters:: fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.
Returns:: the given file(s) as a list of reviews.

sents(fileids=None)[source]¶

Return all sentences in the corpus or in the specified files.

Parameters:: fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
Returns:: the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:: list(list(str))

words(fileids=None)[source]¶

Return all words and punctuation symbols in the corpus or in the specified files.

Parameters:: fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:: the given file(s) as a list of words and punctuation symbols.
Return type:: list(str)

NLTK

Documentation

nltk.corpus.reader.reviews module¶

Customer Review Corpus information¶