nltk.corpus.reader.twitter module¶

A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.

class nltk.corpus.reader.twitter.TwitterCorpusReader[source]¶

Bases: CorpusReader

Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.

Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.

Construct a new Tweet corpus reader for a set of documents located at the given root directory.

If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:

from nltk.corpus import TwitterCorpusReader
reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')

However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:

root = os.environ['TWITTER']
reader = TwitterCorpusReader(root, '.*\.json')

If you want to work directly with the raw Tweets, the json library can be used:

import json
for tweet in reader.docs():
    print(json.dumps(tweet, indent=1, sort_keys=True))

CorpusView¶

The corpus view class used by this reader.

alias of StreamBackedCorpusView

__init__(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]¶

Parameters:

root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking the text of Tweets into smaller units, including but not limited to words.

docs(fileids=None)[source]¶

Returns the full Tweet objects, as specified by Twitter documentation on Tweets

Returns:: the given file(s) as a list of dictionaries deserialised from JSON.
Return type:: list(dict)

strings(fileids=None)[source]¶

Returns only the text content of Tweets in the file(s)

Returns:: the given file(s) as a list of Tweets.
Return type:: list(str)

tokenized(fileids=None)[source]¶

Returns:: the given file(s) as a list of the text content of Tweets as as a list of words, screenanames, hashtags, URLs and punctuation symbols.
Return type:: list(list(str))

NLTK

Documentation

nltk.corpus.reader.twitter module¶