nltk.tokenize.simple module

Simple Tokenizers

These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, use the string split() method directly, as this is more efficient.

The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:

>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> s.split() 
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ') 
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n') 
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']

The simple tokenizers are mainly useful because they follow the standard TokenizerI interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a CorpusReader.

class nltk.tokenize.simple.CharTokenizer[source]

Bases: StringTokenizer

Tokenize a string into individual characters. If this functionality is ever required directly, use for char in string.

span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:

Iterator[Tuple[int, int]]

tokenize(s)[source]

Return a tokenized copy of s.

Return type:

List[str]

class nltk.tokenize.simple.LineTokenizer[source]

Bases: TokenizerI

Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').

>>> from nltk.tokenize import LineTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> LineTokenizer(blanklines='keep').tokenize(s) 
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
>>> # same as [l for l in s.split('\n') if l.strip()]:
>>> LineTokenizer(blanklines='discard').tokenize(s) 
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', 'Thanks.']
Parameters:

blanklines

Indicates how blank lines should be handled. Valid values are:

  • discard: strip blank lines out of the token list before returning it.

    A line is considered blank if it contains only whitespace characters.

  • keep: leave all blank lines in the token list.

  • discard-eof: if the string ends with a newline, then do not generate

    a corresponding token '' after that newline.

__init__(blanklines='discard')[source]
span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:

Iterator[Tuple[int, int]]

tokenize(s)[source]

Return a tokenized copy of s.

Return type:

List[str]

class nltk.tokenize.simple.SpaceTokenizer[source]

Bases: StringTokenizer

Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').

>>> from nltk.tokenize import SpaceTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> SpaceTokenizer().tokenize(s) 
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
class nltk.tokenize.simple.TabTokenizer[source]

Bases: StringTokenizer

Tokenize a string use the tab character as a delimiter, the same as s.split('\t').

>>> from nltk.tokenize import TabTokenizer
>>> TabTokenizer().tokenize('a\tb c\n\t d')
['a', 'b c\n', ' d']
nltk.tokenize.simple.line_tokenize(text, blanklines='discard')[source]