nltk.corpus.reader.reviews module¶
CorpusReader for reviews corpora (syntax based on Customer Review Corpus).
Customer Review Corpus information¶
- Annotated by: Minqing Hu and Bing Liu, 2004.
Department of Computer Science University of Illinois at Chicago
- Contact: Bing Liu, liub@cs.uic.edu
Distributed with permission.
The “product_reviews_1” and “product_reviews_2” datasets respectively contain annotated customer reviews of 5 and 9 products from amazon.com.
Related papers:
- Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), 2004.
- Minqing Hu and Bing Liu. “Mining Opinion Features in Customer Reviews”.
Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), 2004.
- Xiaowen Ding, Bing Liu and Philip S. Yu. “A Holistic Lexicon-Based Appraoch to
Opinion Mining.” Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA.
Symbols used in the annotated reviews:
- [t]:
the title of the review: Each [t] tag starts a review.
- xxxx[+|-n]:
xxxx is a product feature.
- [+n]:
Positive opinion, n is the opinion strength: 3 strongest, and 1 weakest. Note that the strength is quite subjective. You may want ignore it, but only considering + and -
- [-n]:
Negative opinion
- ##:
start of each sentence. Each line is a sentence.
- [u]:
feature not appeared in the sentence.
- [p]:
feature not appeared in the sentence. Pronoun resolution is needed.
- [s]:
suggestion or recommendation.
- [cc]:
comparison with a competing product from a different brand.
- [cs]:
comparison with a competing product from the same brand.
- Note: Some of the files (e.g. “ipod.txt”, “Canon PowerShot SD500.txt”) do not
provide separation between different reviews. This is due to the fact that the dataset was specifically designed for aspect/feature-based sentiment analysis, for which sentence-level annotation is sufficient. For document- level classification and analysis, this peculiarity should be taken into consideration.
- class nltk.corpus.reader.reviews.Review[source]¶
Bases:
object
A Review is the main block of a ReviewsCorpusReader.
- __init__(title=None, review_lines=None)[source]¶
- Parameters:
title – the title of the review.
review_lines – the list of the ReviewLines that belong to the Review.
- add_line(review_line)[source]¶
Add a line (ReviewLine) to the review.
- Parameters:
review_line – a ReviewLine instance that belongs to the Review.
- class nltk.corpus.reader.reviews.ReviewLine[source]¶
Bases:
object
A ReviewLine represents a sentence of the review, together with (optional) annotations of its features and notes about the reviewed item.
- class nltk.corpus.reader.reviews.ReviewsCorpusReader[source]¶
Bases:
CorpusReader
Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.
>>> from nltk.corpus import product_reviews_1 >>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt') >>> review = camera_reviews[0] >>> review.sents()[0] ['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am', 'extremely', 'satisfied', 'with', 'the', 'purchase', '.'] >>> review.features() [('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'), ('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'), ('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'), ('option', '+1')]
We can also reach the same information directly from the stream:
>>> product_reviews_1.features('Canon_G3.txt') [('canon powershot g3', '+3'), ('use', '+2'), ...]
We can compute stats for specific product features:
>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> mean = tot / n_reviews >>> print(n_reviews, tot, mean) 15 24 1.6
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8')[source]¶
- Parameters:
root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WordPunctTokenizer
encoding – the encoding that should be used to read the corpus.
- features(fileids=None)[source]¶
Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.
- Parameters:
fileids – a list or regexp specifying the ids of the files whose features have to be returned.
- Returns:
all features for the item(s) in the given file(s).
- Return type:
list(tuple)
- reviews(fileids=None)[source]¶
Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.
- Parameters:
fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.
- Returns:
the given file(s) as a list of reviews.
- sents(fileids=None)[source]¶
Return all sentences in the corpus or in the specified files.
- Parameters:
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- Returns:
the given file(s) as a list of sentences, each encoded as a list of word strings.
- Return type:
list(list(str))
- words(fileids=None)[source]¶
Return all words and punctuation symbols in the corpus or in the specified files.
- Parameters:
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- Returns:
the given file(s) as a list of words and punctuation symbols.
- Return type:
list(str)