This course hopes to teach:
- the understanding of the effective modern methods for deep learning used in NLP
- basics, RNN, Attention, etc.
- a big picture of understanding human languages and the difficulties in building them
- an understanding of and ability to build NLP systems (in PyTorch), such as explaining word meaning, dependency parsing, machine translation, question answering
Final project:
- default: on dataset SQuAD
Human languages are a most important invention. Knowledge can be represented in languages and this makes human beings intelligent. We have a human computer network that is organized by human languages. Speaking and writing are powerful inventions.
Human languages are highly compressive. Sentences with the latent huge knowledge-graphs can easily construct complicated visual scenes in mind.
A huge challenge of NLP is to represent the meaning of a word:
- problem with resources like WordNet: missing nuance, new meanings; subjective, requiring labor to create and adopt; not able to compute easily.
- traditional NLP (~2012): regards words as discrete symbols like one-hot vectors. But English words can be infinite; discrete symbols share no relationship. (Orthogonal, in math terms.)
Word similar table can be expensive.
Solutions:
- re-encode words into vectors and similarities reside in
These problems lead to distributed semantics. "Distributed" means contexts.
You shall know a word by the company it keeps.
Contexts Matters.
And compared to one-hot vectors, denser vectors have practical benefits.
Word2Vec is a tool to compute word vectors.
Example:
$$ \begin{split} max \ J'(\theta) &= \prod_{t=1}^{T}\prod_{-m\le j \le, j\ne\sigma} p(w'{t+j} | w_t; \theta) \ min \ J(\theta) &= -\frac{1}{T} \sum{t=1}^T\sum_{-m \le j \le m, j\ne 0} \log \ p(w'_{t+j} | w_t) \ \
p(o|c) &= \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^V \exp(u_w^Tv_c)} \ \end{split} $$
[Explain the notebook and word vector visualization]
Word2vec parameters and computations
; sealing with high-frequency words
Solution: Only update the word vectors that actually appear.
Either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors.
The work done behind the word2vec paper has lots of tricks to make it practical.
**Problems with simple co-occurrence vectors: **
- Increase in size with vocabulary
- Very high dimensional
- Subsequent classification models have sparsity issues
Models are less robust.
**Solution: Low dimensional vectors **
- Idea: store "most" of the important information in a fixed, small number of dimensions: a dense vector
- Usually 25-1000 dimensions, similar to word2vec
- How to reduce the dimensionality?
SVD of co-occurrence matrix X: Factorizes X into
Hacks to X: scaling the counts in the cells can help a lot.
-
Problem: function words are too frequent -> syntax has too much impact. Some fixes:
- min(X, t), with t
$\approx$ 100 - Ignore them all
- min(X, t), with t
-
Ramped windows that count closer words more
-
Use Person correlations instead of counts, then set negative values to 0.
-
Etc.
Can we combine the two?
GloVe.
Features:
- Fast training
- Scalable to huge corpora
- Good performance even with small corpus and small vectors
Question: how to choose a good dimension of word vector?
~300 is usually good enough.
Paper: [on the dimensionality of word embedding]