nltk.tokenize.casual module

Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:

  1. The tuple REGEXPS defines a list of regular expression strings.

  2. The REGEXPS strings are put, in order, into a compiled regular expression object called WORD_RE, under the TweetTokenizer class.

  3. The tokenization is done by WORD_RE.findall(s), where s is the user-supplied string, inside the tokenize() method of the class TweetTokenizer.

  4. When instantiating Tokenizer objects, there are several options:
    • preserve_case. By default, it is set to True. If it is set to

      False, then the tokenizer will downcase everything except for emoticons.

    • reduce_len. By default, it is set to False. It specifies whether

      to replace repeated character sequences of length 3 or greater with sequences of length 3.

    • strip_handles. By default, it is set to False. It specifies

      whether to remove Twitter handles of text used in the tokenize method.

    • match_phone_numbers. By default, it is set to True. It indicates

      whether the tokenize method should look for phone numbers.

class nltk.tokenize.casual.TweetTokenizer[source]

Bases: object

Tokenizer for tweets.

>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> tknzr.tokenize(s0)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3'
, 'and', 'some', 'arrows', '<', '>', '->', '<--']

Examples using strip_handles and reduce_len parameters:

>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
>>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
>>> tknzr.tokenize(s1)
[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
__init__(preserve_case=True, reduce_len=False, strip_handles=False, match_phone_numbers=True)[source]

Create a TweetTokenizer instance with settings for use in the tokenize method.

Parameters
  • preserve_case (bool) – Flag indicating whether to preserve the casing (capitalisation) of text used in the tokenize method. Defaults to True.

  • reduce_len (bool) – Flag indicating whether to replace repeated character sequences of length 3 or greater with sequences of length 3. Defaults to False.

  • strip_handles (bool) – Flag indicating whether to remove Twitter handles of text used in the tokenize method. Defaults to False.

  • match_phone_numbers (bool) – Flag indicating whether the tokenize method should look for phone numbers. Defaults to True.

tokenize(text: str) List[str][source]

Tokenize the input text.

Parameters

text (str) – str

Return type

list(str)

Returns

a tokenized list of strings; joining this list returns the original string if preserve_case=False.

property WORD_RE: _regex.Pattern

Core TweetTokenizer regex

property PHONE_WORD_RE: _regex.Pattern

Secondary core TweetTokenizer regex

nltk.tokenize.casual.reduce_lengthening(text)[source]

Replace repeated character sequences of length 3 or greater with sequences of length 3.

nltk.tokenize.casual.remove_handles(text)[source]

Remove Twitter username handles from text.

nltk.tokenize.casual.casual_tokenize(text, preserve_case=True, reduce_len=False, strip_handles=False, match_phone_numbers=True)[source]

Convenience function for wrapping the tokenizer.