Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

How to create word embeddings using Continuous Bag-of-Words (CBOW)

Profile picture of Samson ZhangSamson Zhang
Jul 18, 20225 min read

Earlier I wrote a Kaggle notebook about semantic search using word vectors. The gist of it is that if you represent words as meaningful vectors, you can take the cosine similarity between a pair of words to see how semantically similar they are.



A classic party trick with word vectors is to find the difference between two word vectors and add it to another word to get a fourth word with the same relationship:



This content, and the graphics, were from the first course in deeplearning.ai's Natural Language Processing specialization. The discussion of word vectors themselves was fairly wish-washy then, supplying a set of pre-trained vectors that were "obtained through a machine learning algorithm."

A month later I made it further in the course and actually learned how to train word embeddings from scratch, or at least learned one way to do it. It's surprisingly straightforward but used by models as robust as Google's word2vec, so I thought I'd write a quick blog post about it.

What are word embeddings?

For computers to understand a word, it needs to represent it quantitatively in some way. A really straightforward way is to make a vector with as many dimensions as you have unique words in your vocabulary, then encode each word as a vector of all zeroes except for a single one in the slot corresponding to that word. This is called one-hot encoding.



While straightforward to encode, these vectors aren't very useful by themselves. They're super high-dimensional and don't encode anything about the meaning of the words.

A word embedding or word vector is a way to represent words as vectors that are lower-dimensional and actually reflect their meaning. How might we do this?

Continuous Bag-of-Words

"You shall know a word by the company it keeps." J.R. Firth, English linguist. via the NLP course

The continuous bag-of-words model represents each unique word in a corpus by the words that end to appear around it.

Specifically, for every time that a word appears in a corpus, CBOW looks at the

c c c
words that appear before and after it. In the example below,
c=2 c=2 c=2
and "I", "am", "because" and "I" are the four words in the "bag of words" corresponding to "happy".



Training CBOW

To turn the corpus into training data, we can generate a pair of vectors for every possible window in the corpus. The first vector is simply a one-hot encoding of the center word. The second vector is the average of the one-hot encodings of all other words in its window.



Now feed the second vector (the average of the bag of words) into the following neural network:



It's a really straightforward neural network. The first layer has as many units as words in the corpus vocabulary (call this number

v v v
), and simply takes on the value of the input bag-of-words vector.

The input layer is multiplied by a weight matrix

W_1 W1 W_1
of dimensions
v \times n v×n v \times n
, added with a bias matrix
b_1 b1 b_1
of dimensions
v \times 1 v×1 v \times 1
and put through a ReLU function to get the values for a hidden layer, which has
n n n
units.

Finally, the hidden layer is multiplied by a weight matrix

W_2 W2 W_2
of dimensions
n \times v n×v n \times v
, added with a bias matrix
b_2 b2 b_2
of dimensions
n \times 1 n×1 n \times 1
and put through a softmax function to get the values for the output layer, which will again have
v v v
layers.

The output layer will be compared to the one-hot encoded center word corresponding ot the input vector using a loss function. Ultimately, then, this is an easy-to-train three-layer neural network with no frills to it.

Unlike a normal neutral network, though, we ultimately don't care about the network itself as much as the weights that it trains.

Remember that our original goal was to find a lower-dimensional vector that encodes the context that a given word appears in, and thereby its meaning.

Realize that each column of

W_1 W1 W_1
and each row of
W_2 W2 W_2
is an
n \times 1 n×1 n \times 1
vector corresponding to each word in the corpus vocabulary.

Create a new matrix

W_3 = \frac{W_1 + W_2^T}{2} W3=W1+W2T2 W_3 = \frac{W_1 + W_2^T}{2}
. Now each column of
W_3 W3 W_3
is a single
n \times 1 n×1 n \times 1
vector that contains all the information from our neural network about the context that the corresponding word appears in.

We can set

n n n
to whatever we want. If we set it to a number smaller than
v v v
, then we'll have achieved our goal of creating lower-dimensional word embeddings.

And that's it! I haven't implemented CBOW or any other word embedding model myself yet, but just learning it made the idea of word embeddings a lot more concrete and intuitive.


Comments (loading...)

Sign in to comment

Machine Learning

Notes from learning about ML