nltk.corpus.reader.twitter module

A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.

class nltk.corpus.reader.twitter.TwitterCorpusReader[source]

Bases: CorpusReader

Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.

Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.

Construct a new Tweet corpus reader for a set of documents located at the given root directory.

If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:

from nltk.corpus import TwitterCorpusReader
reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')

However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:

root = os.environ['TWITTER']
reader = TwitterCorpusReader(root, '.*\.json')

If you want to work directly with the raw Tweets, the json library can be used:

import json
for tweet in reader.docs():
    print(json.dumps(tweet, indent=1, sort_keys=True))
CorpusView

The corpus view class used by this reader.

alias of StreamBackedCorpusView

__init__(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking the text of Tweets into smaller units, including but not limited to words.

docs(fileids=None)[source]

Returns the full Tweet objects, as specified by Twitter documentation on Tweets

Returns

the given file(s) as a list of dictionaries deserialised from JSON.

Return type

list(dict)

strings(fileids=None)[source]

Returns only the text content of Tweets in the file(s)

Returns

the given file(s) as a list of Tweets.

Return type

list(str)

tokenized(fileids=None)[source]
Returns

the given file(s) as a list of the text content of Tweets as as a list of words, screenanames, hashtags, URLs and punctuation symbols.

Return type

list(list(str))