nltk.tokenize package¶
Submodules¶
- nltk.tokenize.api module
- nltk.tokenize.casual module
- nltk.tokenize.destructive module
MacIntyreContractions
NLTKWordTokenizer
NLTKWordTokenizer.CONTRACTIONS2
NLTKWordTokenizer.CONTRACTIONS3
NLTKWordTokenizer.CONVERT_PARENTHESES
NLTKWordTokenizer.DOUBLE_DASHES
NLTKWordTokenizer.ENDING_QUOTES
NLTKWordTokenizer.PARENS_BRACKETS
NLTKWordTokenizer.PUNCTUATION
NLTKWordTokenizer.STARTING_QUOTES
NLTKWordTokenizer.span_tokenize()
NLTKWordTokenizer.tokenize()
- nltk.tokenize.legality_principle module
- nltk.tokenize.mwe module
- nltk.tokenize.nist module
NISTTokenizer
NISTTokenizer.DASH_PRECEED_DIGIT
NISTTokenizer.INTERNATIONAL_REGEXES
NISTTokenizer.LANG_DEPENDENT_REGEXES
NISTTokenizer.NONASCII
NISTTokenizer.PERIOD_COMMA_FOLLOW
NISTTokenizer.PERIOD_COMMA_PRECEED
NISTTokenizer.PUNCT
NISTTokenizer.PUNCT_1
NISTTokenizer.PUNCT_2
NISTTokenizer.STRIP_EOL_HYPHEN
NISTTokenizer.STRIP_SKIP
NISTTokenizer.SYMBOLS
NISTTokenizer.international_tokenize()
NISTTokenizer.lang_independent_sub()
NISTTokenizer.number_regex
NISTTokenizer.punct_regex
NISTTokenizer.pup_number
NISTTokenizer.pup_punct
NISTTokenizer.pup_symbol
NISTTokenizer.symbol_regex
NISTTokenizer.tokenize()
- nltk.tokenize.punkt module
PunktBaseClass
PunktLanguageVars
PunktParameters
PunktParameters.__init__()
PunktParameters.abbrev_types
PunktParameters.add_ortho_context()
PunktParameters.clear_abbrevs()
PunktParameters.clear_collocations()
PunktParameters.clear_ortho_context()
PunktParameters.clear_sent_starters()
PunktParameters.collocations
PunktParameters.ortho_context
PunktParameters.sent_starters
PunktSentenceTokenizer
PunktSentenceTokenizer.PUNCTUATION
PunktSentenceTokenizer.__init__()
PunktSentenceTokenizer.debug_decisions()
PunktSentenceTokenizer.dump()
PunktSentenceTokenizer.sentences_from_text()
PunktSentenceTokenizer.sentences_from_text_legacy()
PunktSentenceTokenizer.sentences_from_tokens()
PunktSentenceTokenizer.span_tokenize()
PunktSentenceTokenizer.text_contains_sentbreak()
PunktSentenceTokenizer.tokenize()
PunktSentenceTokenizer.train()
PunktToken
PunktToken.__init__()
PunktToken.abbr
PunktToken.ellipsis
PunktToken.first_case
PunktToken.first_lower
PunktToken.first_upper
PunktToken.is_alpha
PunktToken.is_ellipsis
PunktToken.is_initial
PunktToken.is_non_punct
PunktToken.is_number
PunktToken.linestart
PunktToken.parastart
PunktToken.period_final
PunktToken.sentbreak
PunktToken.tok
PunktToken.type
PunktToken.type_no_period
PunktToken.type_no_sentperiod
PunktTokenizer
PunktTrainer
PunktTrainer.ABBREV
PunktTrainer.ABBREV_BACKOFF
PunktTrainer.COLLOCATION
PunktTrainer.IGNORE_ABBREV_PENALTY
PunktTrainer.INCLUDE_ABBREV_COLLOCS
PunktTrainer.INCLUDE_ALL_COLLOCS
PunktTrainer.MIN_COLLOC_FREQ
PunktTrainer.SENT_STARTER
PunktTrainer.__init__()
PunktTrainer.finalize_training()
PunktTrainer.find_abbrev_types()
PunktTrainer.freq_threshold()
PunktTrainer.get_params()
PunktTrainer.train()
PunktTrainer.train_tokens()
demo()
format_debug_decision()
load_punkt_params()
save_punkt_params()
- nltk.tokenize.regexp module
- nltk.tokenize.repp module
- nltk.tokenize.sexpr module
- nltk.tokenize.simple module
- nltk.tokenize.sonority_sequencing module
- nltk.tokenize.stanford module
- nltk.tokenize.stanford_segmenter module
- nltk.tokenize.texttiling module
- nltk.tokenize.toktok module
ToktokTokenizer
ToktokTokenizer.AMPERCENT
ToktokTokenizer.CLOSE_PUNCT
ToktokTokenizer.CLOSE_PUNCT_RE
ToktokTokenizer.COMMA_IN_NUM
ToktokTokenizer.CURRENCY_SYM
ToktokTokenizer.CURRENCY_SYM_RE
ToktokTokenizer.EN_EM_DASHES
ToktokTokenizer.FINAL_PERIOD_1
ToktokTokenizer.FINAL_PERIOD_2
ToktokTokenizer.FUNKY_PUNCT_1
ToktokTokenizer.FUNKY_PUNCT_2
ToktokTokenizer.LSTRIP
ToktokTokenizer.MULTI_COMMAS
ToktokTokenizer.MULTI_DASHES
ToktokTokenizer.MULTI_DOTS
ToktokTokenizer.NON_BREAKING
ToktokTokenizer.ONE_SPACE
ToktokTokenizer.OPEN_PUNCT
ToktokTokenizer.OPEN_PUNCT_RE
ToktokTokenizer.PIPE
ToktokTokenizer.PROB_SINGLE_QUOTES
ToktokTokenizer.RSTRIP
ToktokTokenizer.STUPID_QUOTES_1
ToktokTokenizer.STUPID_QUOTES_2
ToktokTokenizer.TAB
ToktokTokenizer.TOKTOK_REGEXES
ToktokTokenizer.URL_FOE_1
ToktokTokenizer.URL_FOE_2
ToktokTokenizer.URL_FOE_3
ToktokTokenizer.URL_FOE_4
ToktokTokenizer.tokenize()
- nltk.tokenize.treebank module
TreebankWordDetokenizer
TreebankWordDetokenizer.CONTRACTIONS2
TreebankWordDetokenizer.CONTRACTIONS3
TreebankWordDetokenizer.CONVERT_PARENTHESES
TreebankWordDetokenizer.DOUBLE_DASHES
TreebankWordDetokenizer.ENDING_QUOTES
TreebankWordDetokenizer.PARENS_BRACKETS
TreebankWordDetokenizer.PUNCTUATION
TreebankWordDetokenizer.STARTING_QUOTES
TreebankWordDetokenizer.detokenize()
TreebankWordDetokenizer.tokenize()
TreebankWordTokenizer
TreebankWordTokenizer.CONTRACTIONS2
TreebankWordTokenizer.CONTRACTIONS3
TreebankWordTokenizer.CONVERT_PARENTHESES
TreebankWordTokenizer.DOUBLE_DASHES
TreebankWordTokenizer.ENDING_QUOTES
TreebankWordTokenizer.PARENS_BRACKETS
TreebankWordTokenizer.PUNCTUATION
TreebankWordTokenizer.STARTING_QUOTES
TreebankWordTokenizer.span_tokenize()
TreebankWordTokenizer.tokenize()
- nltk.tokenize.util module
Module contents¶
NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not
using an encoded version of the string (it may be necessary to
decode it first, e.g. with s.decode("utf8")
.
NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
- nltk.tokenize.sent_tokenize(text, language='english')[source]¶
Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently
PunktSentenceTokenizer
for the specified language).- Parameters:
text – text to split into sentences
language – the model name in the Punkt corpus
- nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)[source]¶
Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved
TreebankWordTokenizer
along withPunktSentenceTokenizer
for the specified language).- Parameters:
text (str) – text to split into words
language (str) – the model name in the Punkt corpus
preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.