nltk.test.unit.test_tokenize module

Unit tests for nltk.tokenize. See also nltk/test/tokenize.doctest

nltk.test.unit.test_tokenize.load_stanford_segmenter()[source]
class nltk.test.unit.test_tokenize.TestTokenize[source]

Bases: object

test_tweet_tokenizer()[source]

Test TweetTokenizer using words with special and accented characters.

test_tweet_tokenizer_expanded(test_input: str, expecteds: Tuple[List[str], List[str]])[source]

Test match_phone_numbers in TweetTokenizer.

Note that TweetTokenizer is also passed the following for these tests:
  • strip_handles=True

  • reduce_len=True

Parameters
  • test_input (str) – The input string to tokenize using TweetTokenizer.

  • expecteds (Tuple[List[str], List[str]]) – A 2-tuple of tokenized sentences. The first of the two tokenized is the expected output of tokenization with match_phone_numbers=True. The second of the two tokenized lists is the expected output of tokenization with match_phone_numbers=False.

test_sonority_sequencing_syllable_tokenizer()[source]

Test SyllableTokenizer tokenizer.

test_syllable_tokenizer_numbers()[source]

Test SyllableTokenizer tokenizer.

test_legality_principle_syllable_tokenizer()[source]

Test LegalitySyllableTokenizer tokenizer.

test_stanford_segmenter_arabic()[source]

Test the Stanford Word Segmenter for Arabic (default config)

test_stanford_segmenter_chinese()[source]

Test the Stanford Word Segmenter for Chinese (default config)

test_phone_tokenizer()[source]

Test a string that resembles a phone number but contains a newline

test_emoji_tokenizer()[source]

Test a string that contains Emoji ZWJ Sequences and skin tone modifier

test_pad_asterisk()[source]

Test padding of asterisk for word tokenization.

test_pad_dotdot()[source]

Test padding of dotdot* for word tokenization.

test_remove_handle()[source]

Test remove_handle() from casual.py with specially crafted edge cases

test_treebank_span_tokenizer()[source]

Test TreebankWordTokenizer.span_tokenize function

test_word_tokenize()[source]

Test word_tokenize function

test_punkt_pair_iter()[source]
test_punkt_pair_iter_handles_stop_iteration_exception()[source]
test_punkt_tokenize_words_handles_stop_iteration_exception()[source]
test_punkt_tokenize_custom_lang_vars()[source]
test_punkt_tokenize_no_custom_lang_vars()[source]
punkt_debug_decisions(input_text, n_sents, n_splits, lang_vars=None)[source]
test_punkt_debug_decisions_custom_end()[source]
test_sent_tokenize(sentences: str, expected: List[str])[source]
Parameters
  • sentences (str) –

  • expected (List[str]) –