nltk.tokenize.simple module¶
Simple Tokenizers
These tokenizers divide strings into substrings using the string
split() method.
When tokenizing using a particular delimiter string, use
the string split() method directly, as this is more efficient.
The simple tokenizers are not available as separate functions;
instead, you should just use the string split() method directly:
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
The simple tokenizers are mainly useful because they follow the
standard TokenizerI interface, and so can be used with any code
that expects a tokenizer.  For example, these tokenizers can be used
to specify the tokenization conventions when building a CorpusReader.
- class nltk.tokenize.simple.CharTokenizer[source]¶
- Bases: - StringTokenizer- Tokenize a string into individual characters. If this functionality is ever required directly, use - for char in string.
- class nltk.tokenize.simple.LineTokenizer[source]¶
- Bases: - TokenizerI- Tokenize a string into its lines, optionally discarding blank lines. This is similar to - s.split('\n').- >>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.'] - Parameters:
- blanklines – - Indicates how blank lines should be handled. Valid values are: - discard: strip blank lines out of the token list before returning it.
- A line is considered blank if it contains only whitespace characters. 
 
- keep: leave all blank lines in the token list.
- discard-eof: if the string ends with a newline, then do not generate
- a corresponding token - ''after that newline.
 
 
 
- class nltk.tokenize.simple.SpaceTokenizer[source]¶
- Bases: - StringTokenizer- Tokenize a string using the space character as a delimiter, which is the same as - s.split(' ').- >>> from nltk.tokenize import SpaceTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> SpaceTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.'] 
- class nltk.tokenize.simple.TabTokenizer[source]¶
- Bases: - StringTokenizer- Tokenize a string use the tab character as a delimiter, the same as - s.split('\t').- >>> from nltk.tokenize import TabTokenizer >>> TabTokenizer().tokenize('a\tb c\n\t d') ['a', 'b c\n', ' d']