nltk.tokenize.simple module¶
Simple Tokenizers
These tokenizers divide strings into substrings using the string
split()
method.
When tokenizing using a particular delimiter string, use
the string split()
method directly, as this is more efficient.
The simple tokenizers are not available as separate functions;
instead, you should just use the string split()
method directly:
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York. Please buy me',
'two of them.', '', 'Thanks.']
The simple tokenizers are mainly useful because they follow the
standard TokenizerI
interface, and so can be used with any code
that expects a tokenizer. For example, these tokenizers can be used
to specify the tokenization conventions when building a CorpusReader.
- class nltk.tokenize.simple.CharTokenizer[source]¶
Bases:
StringTokenizer
Tokenize a string into individual characters. If this functionality is ever required directly, use
for char in string
.
- class nltk.tokenize.simple.LineTokenizer[source]¶
Bases:
TokenizerI
Tokenize a string into its lines, optionally discarding blank lines. This is similar to
s.split('\n')
.>>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.']
- Parameters:
blanklines –
Indicates how blank lines should be handled. Valid values are:
discard
: strip blank lines out of the token list before returning it.A line is considered blank if it contains only whitespace characters.
keep
: leave all blank lines in the token list.discard-eof
: if the string ends with a newline, then do not generatea corresponding token
''
after that newline.
- class nltk.tokenize.simple.SpaceTokenizer[source]¶
Bases:
StringTokenizer
Tokenize a string using the space character as a delimiter, which is the same as
s.split(' ')
.>>> from nltk.tokenize import SpaceTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> SpaceTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
- class nltk.tokenize.simple.TabTokenizer[source]¶
Bases:
StringTokenizer
Tokenize a string use the tab character as a delimiter, the same as
s.split('\t')
.>>> from nltk.tokenize import TabTokenizer >>> TabTokenizer().tokenize('a\tb c\n\t d') ['a', 'b c\n', ' d']