nltk.tokenize.RegexpTokenizer¶
- class nltk.tokenize.RegexpTokenizer[source]¶
Bases:
TokenizerI
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
- Parameters
pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:…), instead)
gaps (bool) – True if this tokenizer’s pattern should be used to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.
discard_empty (bool) – True if any empty tokens ‘’ generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.
flags (int) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
- span_tokenize(text)[source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]] ¶
Apply
self.span_tokenize()
to each element ofstrings
. I.e.:return [self.span_tokenize(s) for s in strings]
- Yield
List[Tuple[int, int]]
- Parameters
strings (List[str]) –
- Return type
Iterator[List[Tuple[int, int]]]