nltk.tokenize package¶
Submodules¶
- nltk.tokenize.api module
- nltk.tokenize.casual module
- nltk.tokenize.destructive module- MacIntyreContractions
- NLTKWordTokenizer- NLTKWordTokenizer.CONTRACTIONS2
- NLTKWordTokenizer.CONTRACTIONS3
- NLTKWordTokenizer.CONVERT_PARENTHESES
- NLTKWordTokenizer.DOUBLE_DASHES
- NLTKWordTokenizer.ENDING_QUOTES
- NLTKWordTokenizer.PARENS_BRACKETS
- NLTKWordTokenizer.PUNCTUATION
- NLTKWordTokenizer.STARTING_QUOTES
- NLTKWordTokenizer.span_tokenize()
- NLTKWordTokenizer.tokenize()
 
 
- nltk.tokenize.legality_principle module
- nltk.tokenize.mwe module
- nltk.tokenize.nist module- NISTTokenizer- NISTTokenizer.DASH_PRECEED_DIGIT
- NISTTokenizer.INTERNATIONAL_REGEXES
- NISTTokenizer.LANG_DEPENDENT_REGEXES
- NISTTokenizer.NONASCII
- NISTTokenizer.PERIOD_COMMA_FOLLOW
- NISTTokenizer.PERIOD_COMMA_PRECEED
- NISTTokenizer.PUNCT
- NISTTokenizer.PUNCT_1
- NISTTokenizer.PUNCT_2
- NISTTokenizer.STRIP_EOL_HYPHEN
- NISTTokenizer.STRIP_SKIP
- NISTTokenizer.SYMBOLS
- NISTTokenizer.international_tokenize()
- NISTTokenizer.lang_independent_sub()
- NISTTokenizer.number_regex
- NISTTokenizer.punct_regex
- NISTTokenizer.pup_number
- NISTTokenizer.pup_punct
- NISTTokenizer.pup_symbol
- NISTTokenizer.symbol_regex
- NISTTokenizer.tokenize()
 
 
- nltk.tokenize.punkt module- PunktBaseClass
- PunktLanguageVars
- PunktParameters- PunktParameters.__init__()
- PunktParameters.abbrev_types
- PunktParameters.add_ortho_context()
- PunktParameters.clear_abbrevs()
- PunktParameters.clear_collocations()
- PunktParameters.clear_ortho_context()
- PunktParameters.clear_sent_starters()
- PunktParameters.collocations
- PunktParameters.ortho_context
- PunktParameters.sent_starters
 
- PunktSentenceTokenizer- PunktSentenceTokenizer.PUNCTUATION
- PunktSentenceTokenizer.__init__()
- PunktSentenceTokenizer.debug_decisions()
- PunktSentenceTokenizer.dump()
- PunktSentenceTokenizer.sentences_from_text()
- PunktSentenceTokenizer.sentences_from_text_legacy()
- PunktSentenceTokenizer.sentences_from_tokens()
- PunktSentenceTokenizer.span_tokenize()
- PunktSentenceTokenizer.text_contains_sentbreak()
- PunktSentenceTokenizer.tokenize()
- PunktSentenceTokenizer.train()
 
- PunktToken- PunktToken.__init__()
- PunktToken.abbr
- PunktToken.ellipsis
- PunktToken.first_case
- PunktToken.first_lower
- PunktToken.first_upper
- PunktToken.is_alpha
- PunktToken.is_ellipsis
- PunktToken.is_initial
- PunktToken.is_non_punct
- PunktToken.is_number
- PunktToken.linestart
- PunktToken.parastart
- PunktToken.period_final
- PunktToken.sentbreak
- PunktToken.tok
- PunktToken.type
- PunktToken.type_no_period
- PunktToken.type_no_sentperiod
 
- PunktTokenizer
- PunktTrainer- PunktTrainer.ABBREV
- PunktTrainer.ABBREV_BACKOFF
- PunktTrainer.COLLOCATION
- PunktTrainer.IGNORE_ABBREV_PENALTY
- PunktTrainer.INCLUDE_ABBREV_COLLOCS
- PunktTrainer.INCLUDE_ALL_COLLOCS
- PunktTrainer.MIN_COLLOC_FREQ
- PunktTrainer.SENT_STARTER
- PunktTrainer.__init__()
- PunktTrainer.finalize_training()
- PunktTrainer.find_abbrev_types()
- PunktTrainer.freq_threshold()
- PunktTrainer.get_params()
- PunktTrainer.train()
- PunktTrainer.train_tokens()
 
- demo()
- format_debug_decision()
- load_punkt_params()
- save_punkt_params()
 
- nltk.tokenize.regexp module
- nltk.tokenize.repp module
- nltk.tokenize.sexpr module
- nltk.tokenize.simple module
- nltk.tokenize.sonority_sequencing module
- nltk.tokenize.stanford module
- nltk.tokenize.stanford_segmenter module
- nltk.tokenize.texttiling module
- nltk.tokenize.toktok module- ToktokTokenizer- ToktokTokenizer.AMPERCENT
- ToktokTokenizer.CLOSE_PUNCT
- ToktokTokenizer.CLOSE_PUNCT_RE
- ToktokTokenizer.COMMA_IN_NUM
- ToktokTokenizer.CURRENCY_SYM
- ToktokTokenizer.CURRENCY_SYM_RE
- ToktokTokenizer.EN_EM_DASHES
- ToktokTokenizer.FINAL_PERIOD_1
- ToktokTokenizer.FINAL_PERIOD_2
- ToktokTokenizer.FUNKY_PUNCT_1
- ToktokTokenizer.FUNKY_PUNCT_2
- ToktokTokenizer.LSTRIP
- ToktokTokenizer.MULTI_COMMAS
- ToktokTokenizer.MULTI_DASHES
- ToktokTokenizer.MULTI_DOTS
- ToktokTokenizer.NON_BREAKING
- ToktokTokenizer.ONE_SPACE
- ToktokTokenizer.OPEN_PUNCT
- ToktokTokenizer.OPEN_PUNCT_RE
- ToktokTokenizer.PIPE
- ToktokTokenizer.PROB_SINGLE_QUOTES
- ToktokTokenizer.RSTRIP
- ToktokTokenizer.STUPID_QUOTES_1
- ToktokTokenizer.STUPID_QUOTES_2
- ToktokTokenizer.TAB
- ToktokTokenizer.TOKTOK_REGEXES
- ToktokTokenizer.URL_FOE_1
- ToktokTokenizer.URL_FOE_2
- ToktokTokenizer.URL_FOE_3
- ToktokTokenizer.URL_FOE_4
- ToktokTokenizer.tokenize()
 
 
- nltk.tokenize.treebank module- TreebankWordDetokenizer- TreebankWordDetokenizer.CONTRACTIONS2
- TreebankWordDetokenizer.CONTRACTIONS3
- TreebankWordDetokenizer.CONVERT_PARENTHESES
- TreebankWordDetokenizer.DOUBLE_DASHES
- TreebankWordDetokenizer.ENDING_QUOTES
- TreebankWordDetokenizer.PARENS_BRACKETS
- TreebankWordDetokenizer.PUNCTUATION
- TreebankWordDetokenizer.STARTING_QUOTES
- TreebankWordDetokenizer.detokenize()
- TreebankWordDetokenizer.tokenize()
 
- TreebankWordTokenizer- TreebankWordTokenizer.CONTRACTIONS2
- TreebankWordTokenizer.CONTRACTIONS3
- TreebankWordTokenizer.CONVERT_PARENTHESES
- TreebankWordTokenizer.DOUBLE_DASHES
- TreebankWordTokenizer.ENDING_QUOTES
- TreebankWordTokenizer.PARENS_BRACKETS
- TreebankWordTokenizer.PUNCTUATION
- TreebankWordTokenizer.STARTING_QUOTES
- TreebankWordTokenizer.span_tokenize()
- TreebankWordTokenizer.tokenize()
 
 
- nltk.tokenize.util module
Module contents¶
NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not
using an encoded version of the string (it may be necessary to
decode it first, e.g. with s.decode("utf8").
NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
- nltk.tokenize.sent_tokenize(text, language='english')[source]¶
- Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently - PunktSentenceTokenizerfor the specified language).- Parameters:
- text – text to split into sentences 
- language – the model name in the Punkt corpus 
 
 
- nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)[source]¶
- Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved - TreebankWordTokenizeralong with- PunktSentenceTokenizerfor the specified language).- Parameters:
- text (str) – text to split into words 
- language (str) – the model name in the Punkt corpus 
- preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.