Postulate is the best way to take and share notes for classes, research, and other learning.
Earlier I wrote a Kaggle notebook about semantic search using word vectors. The gist of it is that if you represent words as meaningful vectors, you can take the cosine similarity between a pair of words to see how semantically similar they are.
A classic party trick with word vectors is to find the difference between two word vectors and add it to another word to get a fourth word with the same relationship:
This content, and the graphics, were from the first course in deeplearning.ai's Natural Language Processing specialization. The discussion of word vectors themselves was fairly wish-washy then, supplying a set of pre-trained vectors that were "obtained through a machine learning algorithm."
A month later I made it further in the course and actually learned how to train word embeddings from scratch, or at least learned one way to do it. It's surprisingly straightforward but used by models as robust as Google's word2vec, so I thought I'd write a quick blog post about it.
For computers to understand a word, it needs to represent it quantitatively in some way. A really straightforward way is to make a vector with as many dimensions as you have unique words in your vocabulary, then encode each word as a vector of all zeroes except for a single one in the slot corresponding to that word. This is called one-hot encoding.
While straightforward to encode, these vectors aren't very useful by themselves. They're super high-dimensional and don't encode anything about the meaning of the words.
A word embedding or word vector is a way to represent words as vectors that are lower-dimensional and actually reflect their meaning. How might we do this?
"You shall know a word by the company it keeps." J.R. Firth, English linguist. via the NLP course
The continuous bag-of-words model represents each unique word in a corpus by the words that end to appear around it.
Specifically, for every time that a word appears in a corpus, CBOW looks at the
To turn the corpus into training data, we can generate a pair of vectors for every possible window in the corpus. The first vector is simply a one-hot encoding of the center word. The second vector is the average of the one-hot encodings of all other words in its window.
Now feed the second vector (the average of the bag of words) into the following neural network:
It's a really straightforward neural network. The first layer has as many units as words in the corpus vocabulary (call this number
The input layer is multiplied by a weight matrix
Finally, the hidden layer is multiplied by a weight matrix
The output layer will be compared to the one-hot encoded center word corresponding ot the input vector using a loss function. Ultimately, then, this is an easy-to-train three-layer neural network with no frills to it.
Unlike a normal neutral network, though, we ultimately don't care about the network itself as much as the weights that it trains.
Remember that our original goal was to find a lower-dimensional vector that encodes the context that a given word appears in, and thereby its meaning.
Realize that each column of
Create a new matrix
We can set
And that's it! I haven't implemented CBOW or any other word embedding model myself yet, but just learning it made the idea of word embeddings a lot more concrete and intuitive.
Notes from learning about ML