Continuous Bag-of-Words (CBOW)

All contents is arranged from CS224N contents. Please see the details to the CS224N!

1. Intro

How can we predict a center word from the surrounding context in terms of word vectors?

One approach is If we treat {“The”, “cat”, ’over”, “the’, “puddle”} as a context and from these words, it will be able to predict or generate the center word “jumped”.

We defines as below,

$x^{(c)}$ : The input one-hot vectors or context
$y^{(c)}$ : The output, we just call this y which is the one-hot vector of the known center word.
$w_i$: Word i from vocabulary V
$\nu \in \mathbb{R}^{n \times \lvert V\lvert}$ : Input word matrix
${v}_i$: i-th column of $\nu$ , the input vector representation of word $w_i$
$u \in \mathbb{R}^{\lvert V\lvert\times n}$ : Output word matrix
$u_i$: i-th row of $u$, the output vector representation of word $w_i$
Note that we do in fact learn two vectors for every word $w_i$
- n is an arbitrary size that defines the size of our embedding space
- ${v}_i$: input word vector, when the word is in the context
- $u_i$: output word vector, when the word is in the center

2. Steps

Generate our one-hot word vectors for the input context of size

$m: (x^{c-m}, \dots, x^{c-1}, x^{c+1}, \dots, x^{c+m}) \in \mathbb{R}^{\lvert V\lvert}$
Get Embedded word vectors for the context

$v_{c-m} = \nu x^{c-m}, v_{c-m+1} = \nu x^{c-m+1}, \dots, v_{c+m} = \nu x^{c+m} \in \mathbb{R}^n$
Average these vectors to get

$\hat{v} = \dfrac{v_{c-m}+v_{c-m+1}+\dots+v_{c+m}}{2m} \in \mathbb{R}^n$
Generate a score vector $z = u\hat{v} \in \mathbb{R}^{\lvert V\lvert}$. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
Turn the scores into probabilities $\hat{y} = softmax(x) \in \mathbb{R}^{\lvert V\lvert}$
\[i\small{-}th \ component \ is \ \dfrac { e^{\hat{y}_i} } { \displaystyle\sum_{k=1}^{\lvert V\lvert}e^{\hat{y}_k} }\]
- Exponentiate is to make positive, and dividing by $\displaystyle\sum_{k=1}^{\lvert V\lvert}e^{\hat{y}_k}$ normalizes
  
  the vector $(\textstyle\sum_{k=1}^{n}\hat y_k=1)$ to give probability
We desire our probabilities generated, $\hat{y}\in \mathbb{R}^{\lvert V\lvert}$, to match the true probabilities, $\text{y}\in \mathbb{R}^{\lvert V\lvert}$, which also happens to be the one-hot vector of the actual word.
We use a popular choice of distance/loss measure, cross entropy $H(\hat{y},y)$.

$H(\hat{y}, y) = \displaystyle\sum_{j=1}^{\lvert V\lvert}y_jlog(\hat{y}_j)$
Because y is one-hot vector, thus the above loss simplifies to simply:

$H(\hat{y}, y) = y_jlog(\hat{y}_j)$

In this formulation, c is the index where the correct word’s one-hot vector is 1. We can now consider the case where our prediction was perfect and thus $\hat{y}$c = 1. We can then calculate H($\hat{y}$, y) =-1 log(1) = 0. Thus, for a perfect prediction, we face no penalty or loss.

Reference. cs224n-2019-notes01-wordvecs1

We use stochastic gradient descent to update all relevant word vectors $u_c$ and $v_j$.

Reference. cs224n-2019-notes01-wordvecs1

Reference

PREVIOUSCS224N W5. Self attention and Transformer

NEXTEvaluation of Word Vectors