nltk.tokenize.api module

Tokenizer Interface

class nltk.tokenize.api.TokenizerI[source]

Bases: abc.ABC

A processing interface for tokenizing a string. Subclasses must define tokenize() or tokenize_sents() (or both).

abstract tokenize(s)[source]

Return a tokenized copy of s.

Return type

list of str

span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type

iter(tuple(int, int))

tokenize_sents(strings)[source]

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type

list(list(str))

span_tokenize_sents(strings)[source]

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Return type

iter(list(tuple(int, int)))

class nltk.tokenize.api.StringTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).

tokenize(s)[source]

Return a tokenized copy of s.

Return type

list of str

span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type

iter(tuple(int, int))