Sample usage for gensim¶
Demonstrate word embedding using Gensim¶
>>> from nltk.test.gensim_fixt import setup_module >>> setup_module()
We demonstrate three functions: - Train the word embeddings using brown corpus; - Load the pre-trained model and perform simple tasks; and - Pruning the pre-trained binary model.
>>> import gensim
Train the model¶
Here we train a word embedding using the Brown Corpus:
>>> from nltk.corpus import brown >>> train_set = brown.sents()[:10000] >>> model = gensim.models.Word2Vec(train_set)
It might take some time to train the model. So, after it is trained, it can be saved as follows:
>>> model.save('brown.embedding') >>> new_model = gensim.models.Word2Vec.load('brown.embedding')
- The model will be the list of words with their embedding. We can easily get the vector representation of a word.
>>> len(new_model['university']) 100
There are some supporting functions already implemented in Gensim to manipulate with word embeddings. For example, to compute the cosine similarity between 2 words:
>>> new_model.wv.similarity('university','school') > 0.3 True
Using the pre-trained model¶
NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).
>>> from nltk.data import find >>> word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt')) >>> model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)
We pruned the model to only include the most common words (~44k words).
>>> len(model.vocab) 43981
Each word is represented in the space of 300 dimensions:
>>> len(model['university']) 300
Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.
>>> model.most_similar(positive=['university'], topn = 3) [('universities', 0.70039...), ('faculty', 0.67809...), ('undergraduate', 0.65870...)]
Finding a word that is not in a list is also supported, although, implementing this by yourself is simple.
>>> model.doesnt_match('breakfast cereal dinner lunch'.split()) 'cereal'
Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example, the vector ‘King - Man + Woman’ is close to ‘Queen’ and ‘Germany - Berlin + Paris’ is close to ‘France’.
>>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1) [('queen', 0.71181...)]
>>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1) [('France', 0.78840...)]
We can visualize the word embeddings using t-SNE (https://lvdmaaten.github.io/tsne/). For this demonstration, we visualize the first 1000 words.
Prune the trained binary model¶
Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/ We use this code to get the word2vec_sample model.