All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014
2. Previous Method
- Count-based and rely on matrix factorization (e.g. Latent Semantic Analysis (LSA), Hyperspace Analogue to Language (HAL))
- Effectively leverage global statistical information, they are primarily used to capture word similarities
- Poorly do on tasks such as word analogy, indicating a sub-optimal vector space structure.
- Shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows.
- The capacity to capture complex linguistic patterns beyond word similarity
- Fail to make use of the global co-occurrence statistics.
-
GloVe consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics.
-
Using global statistics to predict the probability of word j appearing in the context of word i with a least squares objective
3. Co-occurrence Matrix
- $X$: word-word co-occurrence matrix
- $X_{ij}$: number of times word $j$ occur in the context of word $i$
- $X_i = \sum_k X_{ik}$: the number of times any word $k$ appears in the context of word $i$
- $P_{ij} = P(w_j \lvert w_i) = \dfrac{X_{ij}}{X_i}$: the probability of $j$ appearing in the context of word $i$
- $V \in R^{n×\lvert V\lvert}$ and $U \in R^{\lvert V\lvert×n}$, Where n is an arbitrary size which defines the size of our embedding space. $V$ is the input word matrix such that the $i$-th column of $V$ is the n-dimensional embedded vector for word wi when it is an input to this model. We denote this $n × 1$ vector as $v_i$. Similarly, $U$ is the output word matrix. The $j$-th row of $U$ is an n-dimensional embedded vector for word $w_j$ when it is an output of the model. We denote this row of U as $u_j$. Note that we do in fact learn two vectors for every word $w_i$ (i.e. input word vector $v_i$ and output word vector $u_i$).
4. Least Squares Objective
$\checkmark$ Explaination in CS224N
-
Softmax: To compute the probability of word $j$ appears in the context of word $i$
\[Q_{ij} = \dfrac{exp(\overrightarrow{u_j}^T \overrightarrow{v_i})}{\displaystyle\sum_{w=1}^W exp(\overrightarrow{u_w}^T \overrightarrow{v_i})}\] -
Global cross-entropy loss
\[J = -\displaystyle\sum_{i \in corpus} \displaystyle\sum_{j \in context(i)} log\ Q_{ij}\]As the same words i and j can appear multiple times in the corpus, it is more efficient to the first group together the same values for i and j:
\[J = -\displaystyle\sum_{i=1}^W \displaystyle\sum_{j=1}^W X_{ij}log\ Q_{ij}\]where the value of co-occurring frequency is given by the co-occurrence matrix X.
Drawback: Cross-entropy loss requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary.
Instead, we use a least square objective in which the normalization factors in P and Q are discarded:
\[\hat J = \displaystyle\sum_{i=1}^W \displaystyle\sum_{j=1}^W X_{i}(\hat P_{ij}-\hat Q_{ij})^2\]where,
\[\hat{P}_{ij} = X_{ij}\ \ and\ \ \hat{Q}_{ij}=exp(\overrightarrow{u_j}^T \overrightarrow{v_i})\]are the unnormalized distributions.
Problem: $-X_{ij}$ often takes on very large values and makes the optimization difficult
Effective change: To minimize the squared error of the logarithms of $\hat{P} \ and \ \hat{Q}$
\[\hat{J} = \displaystyle\sum_{i=1}^W\displaystyle\sum_{j=1}^W X_i(log(\hat{P}_{ij}) - log(\hat{Q}_{ij}))^2 = \displaystyle\sum_{i=1}^W\displaystyle\sum_{j=1}^W X_i(\overrightarrow{u_j}^T \overrightarrow{v_i}-log(X_{ij}))^2\]-
Another observation
The weighting factor $X_i$ is not guaranteed to be optimal. Instead, we introduce a more general weighting function, which we are free to take to depend on the context word as well:
-
$\checkmark$ Explaination in Korean
- Link to https://wikidocs.net/22885
5. Conclusion
- It consistently outperforms word2vec on the word analogy task, given the same corpus, vocabulary, window size, and training time.
- It achieves better results faster, and also obtains the best results irrespective of speed.