nltk.tokenize.mwe module

Multi-Word Expression Tokenizer

A MWETokenizer takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs:

>>> from nltk.tokenize import MWETokenizer
>>> tokenizer = MWETokenizer([('a', 'little'), ('a', 'little', 'bit'), ('a', 'lot')])
>>> tokenizer.add_mwe(('in', 'spite', 'of'))
>>> tokenizer.tokenize('Testing testing testing one two three'.split())
['Testing', 'testing', 'testing', 'one', 'two', 'three']
>>> tokenizer.tokenize('This is a test in spite'.split())
['This', 'is', 'a', 'test', 'in', 'spite']
>>> tokenizer.tokenize('In a little or a little bit or a lot in spite of'.split())
['In', 'a_little', 'or', 'a_little_bit', 'or', 'a_lot', 'in_spite_of']
class nltk.tokenize.mwe.MWETokenizer[source]

Bases: TokenizerI

A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

__init__(mwes=None, separator='_')[source]

Initialize the multi-word tokenizer with a list of expressions and a separator

Parameters
  • mwes (list(list(str))) – A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings.

  • separator (str) – String that should be inserted between words in a multi-word expression token. (Default is ‘_’)

add_mwe(mwe)[source]

Add a multi-word expression to the lexicon (stored as a word trie)

We use util.Trie to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.

Parameters

mwe (tuple(str) or list(str)) – The multi-word expression we’re adding into the word trie

Example

>>> tokenizer = MWETokenizer()
>>> tokenizer.add_mwe(('a', 'b'))
>>> tokenizer.add_mwe(('a', 'b', 'c'))
>>> tokenizer.add_mwe(('a', 'x'))
>>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}}
>>> tokenizer._mwes == expected
True
tokenize(text)[source]
Parameters

text (list(str)) – A list containing tokenized text

Returns

A list of the tokenized text with multi-words merged together

Return type

list(str)

Example

>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']