Skip-gram
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
Create a model such that given the center word “jumped”, the model will be able to predict or generate the surrounding words “The”, “cat”, “over”, “the”, “puddle”.
→ Predicts the distribution (probability) of context words from a center w...
Negative Sampling
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
It is inefficient to sample everything from the whole word dictionary.
→ Locally sampling only the periphery
Problem: The summation over $\lvert V \lvert$ is computationally huge! Any update we do or evaluation of the objective ...
Hierarchical softmax
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower-dimensional vectors.
Hierarchical softmax uses a binary tree to represent all words in the vocabulary.
2...
GloVe(Global Vectors for Word Representation)
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014
2. Previous Method
Count-based and rely on matrix factorization (e.g. Latent Semantic Analysis (LSA), Hyperspace Analogue to Language (HAL))
Effectively leverage gl...
Evaluation of Word Vectors
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intrinsic and Extrinsic Evaluation
$\checkmark$ Intrinsic evaluation
Evaluation on a specific, intermediate task
Fast to compute performance
Helps understand subsystem
Needs positive correlation with real task to determine usefulness
To train a ...
Continuous Bag-of-Words (CBOW)
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
How can we predict a center word from the surrounding context in terms of word vectors?
One approach is If we treat {“The”, “cat”, ’over”, “the’, “puddle”} as a context and from these words, it will be able to predict or generate the center word “jum...
CS224N W5. Self attention and Transformer
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Issue with recurrent models
$\checkmark$ Linear interaction distance
RNNs take O(sequence length) steps for distant word pairs to interact.
Reference. Stanford CS224n, 2021, Before word 'was', Information of chef has gone through O(sequenc...
CS224N W4. Machine Translation Sequence to Sequence And Attention
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Statistical Machine Translation, SMT (1990s-2010s)
“Learn a probabilistic model from data”
it want to find best English sentence y, given French sentence x.
\[argmax_y P(y\lvert x)\]
Use Bayes Rule to break this down into two components to be learned sep...
49 post articles, 7 pages.