nltk.tokenize.RegexpTokenizer

class nltk.tokenize.RegexpTokenizer[source]

Bases: TokenizerI

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

>>> tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
Parameters
  • pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:…), instead)

  • gaps (bool) – True if this tokenizer’s pattern should be used to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.

  • discard_empty (bool) – True if any empty tokens ‘’ generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.

  • flags (int) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.

__init__(pattern, gaps=False, discard_empty=True, flags=RegexFlag.None)[source]
tokenize(text)[source]

Return a tokenized copy of s.

Return type

List[str]

span_tokenize(text)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type

Iterator[Tuple[int, int]]

span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]]

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield

List[Tuple[int, int]]

Parameters

strings (List[str]) –

Return type

Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) List[List[str]]

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type

List[List[str]]

Parameters

strings (List[str]) –