Home

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Intro Create a model such that given the center word “jumped”, the model will be able to predict or generate the surrounding words “The”, “cat”, “over”, “the”, “puddle”. → Predicts the distribution (probability) of context words from a center w...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Intro It is inefficient to sample everything from the whole word dictionary. → Locally sampling only the periphery Problem: The summation over $\lvert V \lvert$ is computationally huge! Any update we do or evaluation of the objective ...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Intro In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower-dimensional vectors. Hierarchical softmax uses a binary tree to represent all words in the vocabulary. 2...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Intro Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014 2. Previous Method Count-based and rely on matrix factorization (e.g. Latent Semantic Analysis (LSA), Hyperspace Analogue to Language (HAL)) Effectively leverage gl...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Intrinsic and Extrinsic Evaluation $\checkmark$ Intrinsic evaluation Evaluation on a specific, intermediate task Fast to compute performance Helps understand subsystem Needs positive correlation with real task to determine usefulness To train a ...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Intro How can we predict a center word from the surrounding context in terms of word vectors? One approach is If we treat {“The”, “cat”, ’over”, “the’, “puddle”} as a context and from these words, it will be able to predict or generate the center word “jum...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Issue with recurrent models $\checkmark$ Linear interaction distance RNNs take O(sequence length) steps for distant word pairs to interact. Reference. Stanford CS224n, 2021, Before word 'was', Information of chef has gone through O(sequenc...

All contents is arranged from CS224N contents. Please see the details to the CS224N! 1. Statistical Machine Translation, SMT (1990s-2010s) “Learn a probabilistic model from data” it want to find best English sentence y, given French sentence x. \[argmax_y P(y\lvert x)\] Use Bayes Rule to break this down into two components to be learned sep...

Skip-gram

Negative Sampling

Hierarchical softmax

GloVe(Global Vectors for Word Representation)

Evaluation of Word Vectors

Continuous Bag-of-Words (CBOW)

CS224N W5. Self attention and Transformer

CS224N W4. Machine Translation Sequence to Sequence And Attention