Language Model Vocabulary
- class nltk.lm.vocabulary.Vocabulary¶
Stores language model vocabulary.
Satisfies two common language modeling requirements for a vocabulary:
When checking membership and calculating its size, filters items by comparing their counts to a cutoff value.
Adds a special “unknown” token which unseen words are mapped to.
>>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd'] >>> from nltk.lm import Vocabulary >>> vocab = Vocabulary(words, unk_cutoff=2)
Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.
>>> vocab['c'] 3 >>> 'c' in vocab True >>> vocab['d'] 2 >>> 'd' in vocab True
Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved.
>>> vocab['b'] 1 >>> 'b' in vocab False >>> vocab['aliens'] 0 >>> 'aliens' in vocab False
Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts.
>>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1) >>> "b" in vocab2 True
The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in len. Note that while the number of keys in the vocabulary’s counter stays the same, the items in the vocabulary differ depending on the cutoff. We use sorted to demonstrate because it keeps the order consistent.
>>> sorted(vocab2.counts) ['-', 'a', 'b', 'c', 'd', 'r'] >>> sorted(vocab2) ['-', '<UNK>', 'a', 'b', 'c', 'd', 'r'] >>> sorted(vocab.counts) ['-', 'a', 'b', 'c', 'd', 'r'] >>> sorted(vocab) ['<UNK>', 'a', 'c', 'd']
In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called “unknown” items. By default it’s “<UNK>”.
>>> "<UNK>" in vocab True
We can look up words in a vocabulary using its lookup method. “Unseen” words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string.
>>> vocab.lookup("a") 'a' >>> vocab.lookup("aliens") '<UNK>'
If given a sequence, it will return an tuple of the looked up words.
>>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c']) ('<UNK>', 'a', '<UNK>', 'd', '<UNK>', 'c')
It’s possible to update the counts after the vocabulary has been created. In general, the interface is the same as that of collections.Counter.
>>> vocab['b'] 1 >>> vocab.update(["b", "b", "c"]) >>> vocab['b'] 3
- __init__(counts=None, unk_cutoff=1, unk_label='<UNK>')¶
Create a new Vocabulary.
counts – Optional iterable or collections.Counter instance to pre-seed the Vocabulary. In case it is iterable, counts are calculated.
unk_cutoff (int) – Words that occur less frequently than this value are not considered part of the vocabulary.
unk_label – Label for marking words not part of vocabulary.
- property cutoff¶
Items with count below this value are not considered part of vocabulary.
- update(*counter_args, **counter_kwargs)¶
Update vocabulary counts.
Wraps collections.Counter.update method.
Look up one or more words in the vocabulary.
If passed one word as a string will return that word or self.unk_label. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words.
words (Iterable(str) or str) – Word(s) to look up.
- Return type
generator(str) or str
TypeError for types other than strings or iterables
>>> from nltk.lm import Vocabulary >>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2) >>> vocab.lookup("a") 'a' >>> vocab.lookup("aliens") '<UNK>' >>> vocab.lookup(["a", "b", "c", ["x", "b"]]) ('a', 'b', '<UNK>', ('<UNK>', 'b'))