-
Notifications
You must be signed in to change notification settings - Fork 16
Home
These days, there is high intention to combine machine learning methods with NLP application. One of the first steps in many solutions is to convert words to vectors. One of the trivial solutions is to use one-hot vectors. Word Embedding methods will use various techniques to convert such long vectors in lower dimensions. In this project we are trying to build an standard benchmark for this task, focusing on Persian language. There are many methods out there and they are quiet sensitive to parameters. For anyone how wants to work in this area having a benchmark is very important.
One of first attempts to perform dimensionality reduction was Latent Semantic Analysis. After that in recent yeard (2013 onward) new embedding methods has been introduced and are used. Here we list the most famous methods that are widely used in NLP field.
This is the most well-known method in Word Embedding and widely used. It is based on list of articles by Mikolov et al. Which is well accepted in the community. It comes in two models:
- Skip-Gram with Negative Sampling (SGNS)
- Continouous Bag of Words (CBOW) You can find very good resources to know about the method here:
- Beginner Introduction to Word Embedding
- Tensor Flow Description of Word2Vec
- Mikolov article on Skip-Gram
- Mikolov article on CBOW
Glove method, does not use the neural network architecture. Instead it defines an objective function that must be minimized and use normal optimization methods to minimize the objective function. This is also one of the most widely used methods:
Regardless of the method that is used to find the word embedding, the valuation method are the same. Also the point is the output is simply vectors and regardless of the method that is used, the output can be used to have a fair evaluation. One of the objective of this project is to build a common framework for evaluating the output of the various model.
We will use following evaluations on the models:
In analogy task we will be given four words and we have to find the relationship. For example : Man to Woman is like King to Queen.
In analogy task the job is done through various ways. If you have a task "A to B is like C to?" What you have to do is to calculate vector "A - B + C" and find the closest vector to that. By closest there are multiple methods:
- Cosine Distance
- Euclidian Distance
- ....
In this project we are going to implement all of these methods. As well as building analogy datasets.
This one is quiet more challenging to evaluate as other factors come into the picture.
In this project we are going to perform following activities. Before we start to detail the tasks please not that this project is performed in 3 stages:
- Project Assignment : This will contain the basic steps which you need to do to get the assignment score.
- Course Project : As Dr Shamsfard has agreed you can choose to continue to work on this project as your course project
- Research Project: You can also decide to continue with this project in order to publish a paper or article.
The first step is mandatory for the assignment but you can continue to work on this project. We are going to categorize the tasks here and based on the stages that they are related to. In terms of category we have following categories and sub categories:
- Corpus : In this category we are going to prepare, analyse and cleanse current Persian corpus available mostly news crawling corpus and weblog crawling corpus.
- Test Data Preparation: In this category we supposed to prepare test data for word embedding models including Analogy, Similarity and NER tasks.
- Evaluation: Writing standard code that performs test over models on our test data. Also in this section we are looking for tools to visualize many aspects of the embedding.
- Model Creation: In this part we will be starting from some out-of-the-box tools to build a model on top of our corpus and evaluate them on our data.
Stage | Category | Task |
---|---|---|
Assignment | Corpus | Prepare and collect the Persian News corpus |
Assignment | Corpus | Prepare and collect the Persian Weblog corpus |
Assignment | Corpus | Prepare and collect the Persian Wikipedia corpus |
Project | Corpus | Evaluate each of the corpuses on number of unique tokens and sentences and remove major duplicates if possible |
Research | Corpus | Create some metrics regarding corpuses and if we can evaluate each corpus against them |
Assignment | Test Data | Prepare a test dataset for analogy |
Assignment | Test Data | Prepare a test dataset for similarity based on word-net |
Assignment | Test Data | Prepare a test dataset for Named Entity Recognition (NER) |
Assignment | Evaluation | Write evaluation code for analogy |
Assignment | Evaluation | Write evaluation code for word similarity |
Assignment | Evaluation | Write evaluation code for NER |
Project | Evaluation | Write visualization code to visualize the embedding in more human readable and understandable model. Build some data sets to compare the proximity of embeddings of certain set of words |
Assignment | Model | Create sample models based on out-of-the-box tools available based on corpuses of the project |
Project | Model | Fine tune the method parameters to get the best result of the models |
Research | Model | Write a report/article to create a baseline for Persian word embedding |
Research | Model | Study the impact of parameters on the quality of the word embedding |