autoscale: true
#[fit] Pipelines #[fit] Tabular Data #[fit] Embeddings
Why might you consider text messages similar?
We denote by
The similarity between two documents
[.footer: images from http://nlp.stanford.edu/IR-book/]
A collection then is a term-document matrix. For example, terms in various "period" novels.
Consider the query q = jealous gossip
. This query turns into the unit vector
Cosine similarity is the dot product of unit vectors.
Wuthering Heights is the top-scoring docu- ment for this query with a score of 0.509.
[.footer: Images by Jay Alammar] [.autoscale: false]
- Observe a bunch of people
- Infer Personality traits from them
- Vector of traits is called an Embedding
- Who is more similar? Jay and who?
- Use Cosine Similarity of the vectors
Example:
Rossmann Kaggle Competition. Rossmann is a 3000 store European Drug Store Chain. The idea is to predict sales 6 weeks in advance.
Consider store_id
as an example. This is a categorical predictor, i.e. values come from a finite set.
We usually one-hot encode this: a single store is a length 3000 bit-vector with one bit flipped on.
- The 3000 stores have commonalities, but the one-hot encoding does not represent this
- Indeed the dot-product (cosine similarity) of any-2 1-hot bitmaps must be 0
- Would be useful to learn a lower-dimensional embedding for the purpose of sales prediction.
- These store "personalities" could then be used in other models (different from the model used to learn the embedding) for sales prediction
- The embedding can be also used for other tasks, such as employee turnover prediction
[.footer: Image from Guo and Berkhahn]
- Normally you would do a linear or MLP regression with sales as the target, and both continuous and categorical features
- The game is to replace the 1-hot encoded categorical features by "lower-width" embedding features, for each categorical predictor
- This is equivalent to considering a neural network with the output of an additional Embedding Layer concatenated in
- The Embedding layer is simply a linear regression
A 1-hot vector for a categorical variable
Then an embedding of width (dimension)
But how do we find these weights? We fit for them with the rest of the weights in the MLP!
[.code-highlight: all] [.code-highlight: 6] [.code-highlight: 12] [.code-highlight: 13-17] [.code-highlight: all]
def build_keras_model():
input_cat = []
output_embeddings = []
for k in cat_vars+nacols_cat: #categoricals plus NA booleans
input_1d = Input(shape=(1,))
output_1d = Embedding(input_cardinality[k], embedding_cardinality[k], name='{}_embedding'.format(k))(input_1d)
output = Reshape(target_shape=(embedding_cardinality[k],))(output_1d)
input_cat.append(input_1d)
output_embeddings.append(output)
main_input = Input(shape=(len(cont_vars),), name='main_input')
output_model = Concatenate()([main_input, *output_embeddings])
output_model = Dense(1000, kernel_initializer="uniform")(output_model)
output_model = Activation('relu')(output_model)
output_model = Dense(500, kernel_initializer="uniform")(output_model)
output_model = Activation('relu')(output_model)
output_model = Dense(1)(output_model)
kmodel = KerasModel(
inputs=[*input_cat, main_input],
outputs=output_model
)
kmodel.compile(loss='mean_squared_error', optimizer='adam')
return kmodel
def fitmodel(kmodel, Xtr, ytr, Xval, yval, epochs, bs):
h = kmodel.fit(Xtr, ytr, validation_data=(Xval, yval),
epochs=epochs, batch_size=bs)
return h
We want to make a recommendation (on, say, a 1-5 scale) for an item
We can write this as:
where $$ Y_{um}^{baseline} = \mu + \bar \theta \cdot I_{u} + \bar \gamma \cdot I_{m} $$
where the unknown parameters
Remember this is a sparse problem. Most users have not rated most items. So we will want to regularize. Thus we want to minimize the loss:
This is a Ridge Regression, or SGD with weight decay.
The baseline is not enough, though. Lets add a term
Associate with each item a vector
Then we model the residuals as:
the user's overall interest in the item's characteristics.
So, we want:
To solve this we need to simply minimize the risk of the entire regression, ie
We have seen the
So why are we giving it another name?
- it is usually to a lower dimensional space
- traditionally we have done linear dimensional reduction through PCA or SVD and truncation, but sparsity can throw a spanner into the works
- we train the weights of the embedding regression using SGD, along with the weights of the downstream task (here fitting the rating)
- the embedding can be used for alternate tasks, such as finding the similarity of users.
See how Spotify does all this..
def embedding_input(emb_name, n_items, n_fact=20, l2regularizer=1e-4):
inp = Input(shape=(1,), dtype='int64', name=emb_name)
return inp, Embedding(n_items, n_fact, input_length=1, embeddings_regularizer=l2(l2regularizer))(inp)
usr_inp, usr_emb = embedding_input('user_in', n_users, n_fact=50, l2regularizer=1e-4)
mov_inp, mov_emb = embedding_input('movie_in', n_movies, n_fact=50, l2regularizer=1e-4)
def create_bias(inp, n_items):
x = Embedding(n_items, 1, input_length=1)(inp)
return Flatten()(x)
usr_bias = create_bias(usr_inp, n_users)
mov_bias = create_bias(mov_inp, n_movies)
def build_dp_bias_recommender(u_in, m_in, u_emb, m_emb, u_bias, m_bias):
x = dot([u_emb, m_emb], axes=(2,2))
x = Flatten()(x)
x = add([x, u_bias])
x = add([x, m_bias])
bias_model = Model([u_in, m_in], x)
bias_model.compile(Adam(0.001), loss='mse')
return bias_model
bias_model = build_dp_bias_recommender(usr_inp, mov_inp, usr_emb, mov_emb, usr_bias, mov_bias)
[.footer: images in this section from Illustrated word2vec]
- The Vocabulary
$$V$$ of a corpus (large swath of text) can have 10,000 and maybe more words - a 1-hot encoding is huge, moreover, similarities between words cannot be established
- we map words to a smaller dimensional latent space of size
$$L$$ by considering some downstream task to train on - we hope that the embeddings learnt are useful for other tasks.
See man->boy as woman->girl, similarities of king and queen, for eg. These are lower dimensional GloVe embedding vectors
We need to choose a downstream task. We could choose Language Modeling: predict the next word. We'll start with random "weights" for the embeddings and other parameters and run SGD. A trained model+embeddings would look like this:
How do we set up a training set?
Why not look both ways? This leads to the Skip-Gram and CBOW architectures..
Choose a window size (here 4) and construct a dataset by sliding a window across.
- the pre-trained word2vec and other embeddings (such as GloVe) are used everywhere in NLP today
- the ideas have been used elsewhere as well. AirBnB and Anghami model sequences of listings and songs using word2vec like techniques
- Alibaba and Facebook use word2vec and graph embeddings for recommendations and social network analysis.