nltk.tokenize package¶
Submodules¶
- nltk.tokenize.api module
- nltk.tokenize.casual module
- nltk.tokenize.destructive module
MacIntyreContractionsNLTKWordTokenizerNLTKWordTokenizer.CONTRACTIONS2NLTKWordTokenizer.CONTRACTIONS3NLTKWordTokenizer.CONVERT_PARENTHESESNLTKWordTokenizer.DOUBLE_DASHESNLTKWordTokenizer.ENDING_QUOTESNLTKWordTokenizer.PARENS_BRACKETSNLTKWordTokenizer.PUNCTUATIONNLTKWordTokenizer.STARTING_QUOTESNLTKWordTokenizer.span_tokenize()NLTKWordTokenizer.tokenize()
- nltk.tokenize.legality_principle module
- nltk.tokenize.mwe module
- nltk.tokenize.nist module
NISTTokenizerNISTTokenizer.DASH_PRECEED_DIGITNISTTokenizer.INTERNATIONAL_REGEXESNISTTokenizer.LANG_DEPENDENT_REGEXESNISTTokenizer.NONASCIINISTTokenizer.PERIOD_COMMA_FOLLOWNISTTokenizer.PERIOD_COMMA_PRECEEDNISTTokenizer.PUNCTNISTTokenizer.PUNCT_1NISTTokenizer.PUNCT_2NISTTokenizer.STRIP_EOL_HYPHENNISTTokenizer.STRIP_SKIPNISTTokenizer.SYMBOLSNISTTokenizer.international_tokenize()NISTTokenizer.lang_independent_sub()NISTTokenizer.number_regexNISTTokenizer.punct_regexNISTTokenizer.pup_numberNISTTokenizer.pup_punctNISTTokenizer.pup_symbolNISTTokenizer.symbol_regexNISTTokenizer.tokenize()
- nltk.tokenize.punkt module
PunktBaseClassPunktLanguageVarsPunktParametersPunktParameters.__init__()PunktParameters.abbrev_typesPunktParameters.add_ortho_context()PunktParameters.clear_abbrevs()PunktParameters.clear_collocations()PunktParameters.clear_ortho_context()PunktParameters.clear_sent_starters()PunktParameters.collocationsPunktParameters.ortho_contextPunktParameters.sent_starters
PunktSentenceTokenizerPunktSentenceTokenizer.PUNCTUATIONPunktSentenceTokenizer.__init__()PunktSentenceTokenizer.debug_decisions()PunktSentenceTokenizer.dump()PunktSentenceTokenizer.sentences_from_text()PunktSentenceTokenizer.sentences_from_text_legacy()PunktSentenceTokenizer.sentences_from_tokens()PunktSentenceTokenizer.span_tokenize()PunktSentenceTokenizer.text_contains_sentbreak()PunktSentenceTokenizer.tokenize()PunktSentenceTokenizer.train()
PunktTokenPunktToken.__init__()PunktToken.abbrPunktToken.ellipsisPunktToken.first_casePunktToken.first_lowerPunktToken.first_upperPunktToken.is_alphaPunktToken.is_ellipsisPunktToken.is_initialPunktToken.is_non_punctPunktToken.is_numberPunktToken.linestartPunktToken.parastartPunktToken.period_finalPunktToken.sentbreakPunktToken.tokPunktToken.typePunktToken.type_no_periodPunktToken.type_no_sentperiod
PunktTokenizerPunktTrainerPunktTrainer.ABBREVPunktTrainer.ABBREV_BACKOFFPunktTrainer.COLLOCATIONPunktTrainer.IGNORE_ABBREV_PENALTYPunktTrainer.INCLUDE_ABBREV_COLLOCSPunktTrainer.INCLUDE_ALL_COLLOCSPunktTrainer.MIN_COLLOC_FREQPunktTrainer.SENT_STARTERPunktTrainer.__init__()PunktTrainer.finalize_training()PunktTrainer.find_abbrev_types()PunktTrainer.freq_threshold()PunktTrainer.get_params()PunktTrainer.train()PunktTrainer.train_tokens()
demo()format_debug_decision()load_punkt_params()save_punkt_params()
- nltk.tokenize.regexp module
- nltk.tokenize.repp module
- nltk.tokenize.sexpr module
- nltk.tokenize.simple module
- nltk.tokenize.sonority_sequencing module
- nltk.tokenize.stanford module
- nltk.tokenize.stanford_segmenter module
- nltk.tokenize.texttiling module
- nltk.tokenize.toktok module
ToktokTokenizerToktokTokenizer.AMPERCENTToktokTokenizer.CLOSE_PUNCTToktokTokenizer.CLOSE_PUNCT_REToktokTokenizer.COMMA_IN_NUMToktokTokenizer.CURRENCY_SYMToktokTokenizer.CURRENCY_SYM_REToktokTokenizer.EN_EM_DASHESToktokTokenizer.FINAL_PERIOD_1ToktokTokenizer.FINAL_PERIOD_2ToktokTokenizer.FUNKY_PUNCT_1ToktokTokenizer.FUNKY_PUNCT_2ToktokTokenizer.LSTRIPToktokTokenizer.MULTI_COMMASToktokTokenizer.MULTI_DASHESToktokTokenizer.MULTI_DOTSToktokTokenizer.NON_BREAKINGToktokTokenizer.ONE_SPACEToktokTokenizer.OPEN_PUNCTToktokTokenizer.OPEN_PUNCT_REToktokTokenizer.PIPEToktokTokenizer.PROB_SINGLE_QUOTESToktokTokenizer.RSTRIPToktokTokenizer.STUPID_QUOTES_1ToktokTokenizer.STUPID_QUOTES_2ToktokTokenizer.TABToktokTokenizer.TOKTOK_REGEXESToktokTokenizer.URL_FOE_1ToktokTokenizer.URL_FOE_2ToktokTokenizer.URL_FOE_3ToktokTokenizer.URL_FOE_4ToktokTokenizer.tokenize()
- nltk.tokenize.treebank module
TreebankWordDetokenizerTreebankWordDetokenizer.CONTRACTIONS2TreebankWordDetokenizer.CONTRACTIONS3TreebankWordDetokenizer.CONVERT_PARENTHESESTreebankWordDetokenizer.DOUBLE_DASHESTreebankWordDetokenizer.ENDING_QUOTESTreebankWordDetokenizer.PARENS_BRACKETSTreebankWordDetokenizer.PUNCTUATIONTreebankWordDetokenizer.STARTING_QUOTESTreebankWordDetokenizer.detokenize()TreebankWordDetokenizer.tokenize()
TreebankWordTokenizerTreebankWordTokenizer.CONTRACTIONS2TreebankWordTokenizer.CONTRACTIONS3TreebankWordTokenizer.CONVERT_PARENTHESESTreebankWordTokenizer.DOUBLE_DASHESTreebankWordTokenizer.ENDING_QUOTESTreebankWordTokenizer.PARENS_BRACKETSTreebankWordTokenizer.PUNCTUATIONTreebankWordTokenizer.STARTING_QUOTESTreebankWordTokenizer.span_tokenize()TreebankWordTokenizer.tokenize()
- nltk.tokenize.util module
Module contents¶
NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not
using an encoded version of the string (it may be necessary to
decode it first, e.g. with s.decode("utf8").
NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
- nltk.tokenize.sent_tokenize(text, language='english')[source]¶
Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently
PunktSentenceTokenizerfor the specified language).- Parameters:
text – text to split into sentences
language – the model name in the Punkt corpus
- nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)[source]¶
Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved
TreebankWordTokenizeralong withPunktSentenceTokenizerfor the specified language).- Parameters:
text (str) – text to split into words
language (str) – the model name in the Punkt corpus
preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.