nltk.corpus.reader.comparative_sents module

CorpusReader for the Comparative Sentence Dataset.

  • Comparative Sentence Dataset information -

Annotated by: Nitin Jindal and Bing Liu, 2006.

Department of Computer Sicence University of Illinois at Chicago

Contact: Nitin Jindal, njindal@cs.uic.edu

Bing Liu, liub@cs.uic.edu (https://www.cs.uic.edu/~liub)

Distributed with permission.

Related papers:

  • Nitin Jindal and Bing Liu. “Identifying Comparative Sentences in Text Documents”.

    Proceedings of the ACM SIGIR International Conference on Information Retrieval (SIGIR-06), 2006.

  • Nitin Jindal and Bing Liu. “Mining Comprative Sentences and Relations”.

    Proceedings of Twenty First National Conference on Artificial Intelligence (AAAI-2006), 2006.

  • Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.

    Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.

class nltk.corpus.reader.comparative_sents.ComparativeSentencesCorpusReader[source]

Bases: CorpusReader

Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).

>>> from nltk.corpus import comparative_sentences
>>> comparison = comparative_sentences.comparisons()[0]
>>> comparison.text 
['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly',
'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve",
'had', '.']
>>> comparison.entity_2
'models'
>>> (comparison.feature, comparison.keyword)
('rewind', 'more')
>>> len(comparative_sentences.comparisons())
853
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – a list or regexp specifying the fileids in this corpus.

  • word_tokenizer – tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer

  • sent_tokenizer – tokenizer for breaking paragraphs into sentences.

  • encoding – the encoding that should be used to read the corpus.

comparisons(fileids=None)[source]

Return all comparisons in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.

Returns

the given file(s) as a list of Comparison objects.

Return type

list(Comparison)

keywords(fileids=None)[source]

Return a set of all keywords used in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.

Returns

the set of keywords and comparative phrases used in the corpus.

Return type

set(str)

keywords_readme()[source]

Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).

sents(fileids=None)[source]

Return all sentences in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

Returns

all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).

Return type

list(list(str)) or list(str)

words(fileids=None)[source]

Return all words and punctuation symbols in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.comparative_sents.Comparison[source]

Bases: object

A Comparison represents a comparative sentence and its constituents.

__init__(text=None, comp_type=None, entity_1=None, entity_2=None, feature=None, keyword=None)[source]
Parameters
  • text – a string (optionally tokenized) containing a comparison.

  • comp_type – an integer defining the type of comparison expressed. Values can be: 1 (Non-equal gradable), 2 (Equative), 3 (Superlative), 4 (Non-gradable).

  • entity_1 – the first entity considered in the comparison relation.

  • entity_2 – the second entity considered in the comparison relation.

  • feature – the feature considered in the comparison relation.

  • keyword – the word or phrase which is used for that comparative relation.