nltk.tokenize.word_tokenize

nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)[source]

Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).

Parameters
  • text (str) – text to split into words

  • language (str) – the model name in the Punkt corpus

  • preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.