nltk.corpus.reader.comparative_sents module

CorpusReader for the Comparative Sentence Dataset.

  • Comparative Sentence Dataset information -

Annotated by: Nitin Jindal and Bing Liu, 2006.

Department of Computer Sicence University of Illinois at Chicago

Contact: Nitin Jindal, njindal@cs.uic.edu

Bing Liu, liub@cs.uic.edu (https://www.cs.uic.edu/~liub)

Distributed with permission.

Related papers:

  • Nitin Jindal and Bing Liu. “Identifying Comparative Sentences in Text Documents”.

    Proceedings of the ACM SIGIR International Conference on Information Retrieval (SIGIR-06), 2006.

  • Nitin Jindal and Bing Liu. “Mining Comprative Sentences and Relations”.

    Proceedings of Twenty First National Conference on Artificial Intelligence (AAAI-2006), 2006.

  • Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.

    Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.

class nltk.corpus.reader.comparative_sents.ComparativeSentencesCorpusReader[source]

Bases: CorpusReader

Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).

>>> from nltk.corpus import comparative_sentences
>>> comparison = comparative_sentences.comparisons()[0]
>>> comparison.text 
['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly',
'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve",
'had', '.']
>>> comparison.entity_2
'models'
>>> (comparison.feature, comparison.keyword)
('rewind', 'more')
>>> len(comparative_sentences.comparisons())
853
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8')[source]
Parameters:
  • root – The root directory for this corpus.

  • fileids – a list or regexp specifying the fileids in this corpus.

  • word_tokenizer – tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer

  • sent_tokenizer – tokenizer for breaking paragraphs into sentences.

  • encoding – the encoding that should be used to read the corpus.

comparisons(fileids=None)[source]

Return all comparisons in the corpus.

Parameters:

fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.

Returns:

the given file(s) as a list of Comparison objects.

Return type:

list(Comparison)

keywords(fileids=None)[source]

Return a set of all keywords used in the corpus.

Parameters:

fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.

Returns:

the set of keywords and comparative phrases used in the corpus.

Return type:

set(str)

keywords_readme()[source]

Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).

sents(fileids=None)[source]

Return all sentences in the corpus.

Parameters:

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

Returns:

all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).

Return type:

list(list(str)) or list(str)

words(fileids=None)[source]

Return all words and punctuation symbols in the corpus.

Parameters:

fileids – a list or regexp specifying the ids of the files whose words have to be returned.

Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

class nltk.corpus.reader.comparative_sents.Comparison[source]

Bases: object

A Comparison represents a comparative sentence and its constituents.

__init__(text=None, comp_type=None, entity_1=None, entity_2=None, feature=None, keyword=None)[source]
Parameters:
  • text – a string (optionally tokenized) containing a comparison.

  • comp_type – an integer defining the type of comparison expressed. Values can be: 1 (Non-equal gradable), 2 (Equative), 3 (Superlative), 4 (Non-gradable).

  • entity_1 – the first entity considered in the comparison relation.

  • entity_2 – the second entity considered in the comparison relation.

  • feature – the feature considered in the comparison relation.

  • keyword – the word or phrase which is used for that comparative relation.