nltk.tokenize.api module¶
Tokenizer Interface
- class nltk.tokenize.api.StringTokenizer[source]¶
Bases:
TokenizerIA tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).
- class nltk.tokenize.api.TokenizerI[source]¶
Bases:
ABCA processing interface for tokenizing a string. Subclasses must define
tokenize()ortokenize_sents()(or both).- span_tokenize(s: str) Iterator[Tuple[int, int]][source]¶
Identify the tokens using integer offsets
(start_i, end_i), wheres[start_i:end_i]is the corresponding token.- Return type:
Iterator[Tuple[int, int]]
- Parameters:
s (str)
- span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]][source]¶
Apply
self.span_tokenize()to each element ofstrings. I.e.:return [self.span_tokenize(s) for s in strings]
- Yield:
List[Tuple[int, int]]
- Parameters:
strings (List[str])
- Return type:
Iterator[List[Tuple[int, int]]]