Hierarchical softmax
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower-dimensional vectors.
Hierarchical softmax uses a binary tree to represent all words in the vocabulary.
2...
GloVe(Global Vectors for Word Representation)
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014
2. Previous Method
Count-based and rely on matrix factorization (e.g. Latent Semantic Analysis (LSA), Hyperspace Analogue to Language (HAL))
Effectively leverage gl...
Evaluation of Word Vectors
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intrinsic and Extrinsic Evaluation
$\checkmark$ Intrinsic evaluation
Evaluation on a specific, intermediate task
Fast to compute performance
Helps understand subsystem
Needs positive correlation with real task to determine usefulness
To train a ...
Continuous Bag-of-Words (CBOW)
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Intro
How can we predict a center word from the surrounding context in terms of word vectors?
One approach is If we treat {“The”, “cat”, ’over”, “the’, “puddle”} as a context and from these words, it will be able to predict or generate the center word “jum...
CS224N W5. Self attention and Transformer
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Issue with recurrent models
$\checkmark$ Linear interaction distance
RNNs take O(sequence length) steps for distant word pairs to interact.
Reference. Stanford CS224n, 2021, Before word 'was', Information of chef has gone through O(sequenc...
CS224N W4. Machine Translation Sequence to Sequence And Attention
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Statistical Machine Translation, SMT (1990s-2010s)
“Learn a probabilistic model from data”
it want to find best English sentence y, given French sentence x.
\[argmax_y P(y\lvert x)\]
Use Bayes Rule to break this down into two components to be learned sep...
CS224N W3. RNN, Bi-RNN, GRU, and LTSM in dependency parsing
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Language Model
Language Modeling is the task of predicting what word comes next.
the students opened their [——] → books? laptops? exams? minds?
A system that does this is called a Language Model. More formally: given a sequence of words $x^{(1)}, x^{(2)},...
CS224N W2. Neural Network and What is the Dependency Parsing
All contents is arranged from CS224N contents. Please see the details to the CS224N!
1. Named Entity Recognition(NER)
Task: Find and Classify names in text
Example
Reference. Stanford CS224n, 2021
Usages
Tracking mentions of particular entities in documents
For question answerin...
47 post articles, 6 pages.