Skip to content
Ehsan edited this page Dec 1, 2018 · 5 revisions

Word Embedding Benchmark

Background

These days, there is high intention to combine machine learning methods with NLP application. One of the first steps in many solutions is to convert words to vectors. One of the trivial solutions is to use one-hot vectors. Word Embedding methods will use various techniques to convert such long vectors in lower dimensions. In this project we are trying to build an standard benchmark for this task, focusing on Persian language. There are many methods out there and they are quiet sensitive to parameters. For anyone how wants to work in this area having a benchmark is very important.

Literature Review

One of first attempts to perform dimensionality reduction was Latent Semantic Analysis. After that in recent yeard (2013 onward) new embedding methods has been introduced and are used. Here we list the most famous methods that are widely used in NLP field.

Word2Vec

This is the most well-known method in Word Embedding and widely used. It is based on list of articles by Mikolov et al. Which is well accepted in the community. It comes in two models:

Glove

Glove method, does not use the neural network architecture. Instead it defines an objective function that must be minimized and use normal optimization methods to minimize the objective function. This is also one of the most widely used methods:

FastText

Elmo

Evaluation

Regardless of the method that is used to find the word embedding, the valuation method are the same. Also the point is the output is simply vectors and regardless of the method that is used, the output can be used to have a fair evaluation. One of the objective of this project is to build a common framework for evaluating the output of the various model.

We will use following evaluations on the models:

Analogy

In analogy task we will be given four words and we have to find the relationship. For example : Man to Woman is like King to Queen.

In analogy task the job is done through various ways. If you have a task "A to B is like C to?" What you have to do is to calculate vector "A - B + C" and find the closest vector to that. By closest there are multiple methods:

  • Cosine Distance
  • Euclidian Distance
  • ....

In this project we are going to implement all of these methods. As well as building analogy datasets.

Word Similarity

Named Entity Recognition

This one is quiet more challenging to evaluate as other factors come into the picture.

Project Items

In this project we are going to perform following activities. Before we start to detail the tasks please not that this project is performed in 3 stages:

  1. Project Assignment : This will contain the basic steps which you need to do to get the assignment score.
  2. Course Project : As Dr Shamsfard has agreed you can choose to continue to work on this project as your course project
  3. Research Project: You can also decide to continue with this project in order to publish a paper or article.

The first step is mandatory for the assignment but you can continue to work on this project. We are going to categorize the tasks here and based on the stages that they are related to. In terms of category we have following categories and sub categories:

  • Corpus : In this category we are going to prepare, analyse and cleanse current Persian corpus available mostly news crawling corpus and weblog crawling corpus.
  • Test Data Preparation: In this category we supposed to prepare test data for word embedding models including Analogy, Similarity and NER tasks.
  • Evaluation: Writing standard code that performs test over models on our test data. Also in this section we are looking for tools to visualize many aspects of the embedding.
  • Model Creation: In this part we will be starting from some out-of-the-box tools to build a model on top of our corpus and evaluate them on our data.
Stage Category Task
Assignment Corpus Prepare and collect the Persian News corpus
Assignment Corpus Prepare and collect the Persian Weblog corpus
Assignment Corpus Prepare and collect the Persian Wikipedia corpus
Project Corpus Evaluate each of the corpuses on number of unique tokens and sentences and remove major duplicates if possible
Research Corpus Create some metrics regarding corpuses and if we can evaluate each corpus against them
Assignment Test Data Prepare a test dataset for analogy
Assignment Test Data Prepare a test dataset for similarity based on word-net
Assignment Test Data Prepare a test dataset for Named Entity Recognition (NER)
Assignment Evaluation Write evaluation code for analogy
Assignment Evaluation Write evaluation code for word similarity
Assignment Evaluation Write evaluation code for NER
Project Evaluation Write visualization code to visualize the embedding in more human readable and understandable model. Build some data sets to compare the proximity of embeddings of certain set of words
Assignment Model Create sample models based on out-of-the-box tools available based on corpuses of the project
Project Model Fine tune the method parameters to get the best result of the models
Research Model Write a report/article to create a baseline for Persian word embedding
Research Model Study the impact of parameters on the quality of the word embedding