NLTK

Documentation

nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)[source]¶

Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).

Parameters

text (str) – text to split into words
language (str) – the model name in the Punkt corpus
preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.