Comparison of company names and search for similar companies in our database
Our solution consists of several main parts:
- Сlassifier - solving a classification problem to determine whether the names are one firm
- Recommendation - solving the recommendation problem in order to suggest top n similar company names for the name of one company
There is a strict interface for each of the parts. Each part is independent of the other. A special interface has been implemented for this module.
This rep presents three methods for solving the problem:
- Using Bert
- Using Sentence Transformers
- Using FastText
.
├── data
├── notebooks <- Jupyter notebooks
├── README.md <- The top-level README for developers using this project
├── requirements.txt <- The requirements file for reproducing the analysis environment
├── weights <- Empty folder for saving results
├── src
│ ├── bert <- Folder that contains bert solution
│ ├── fasttext <- Folder that contains fasttext solution
│ ├── sentence_bert <- Folder that contains tentence transformers solution
│ ├── utils
└── tutorial.ipynb <- Demonstration work
To track the results of experiments, we used MLFlow. MLFlow - an open source platform for the machine learning lifecycle. We used the architecture shown below in the picture.
With the help of MLFlow, the following tasks were solved:
Classification task
Model | F1 Macro Score |
---|---|
Bert | 0.97 |
Sentence Bert | 0,61 |
FastText | 0.87 |
Sentence Bert has worse results than Bert due to the peculiarities of the models. Sentence Bert is used to build embeds and cosine distance is calculated from them, and the names of companies that had similar words will have a similar representation of embeds.
We tested three different models:
- bert
- sentence transformer
- FastText
- Sava artifacts You can combine them however you like. Be careful with experiments, look at the results.
To demonstrate the results of the project, you can use a tutorial.ipynb Before using it, you need to install the project dependencies:
pip install -r requirements.txt
After installing the dependencies, you need to be in the root folder of the repository run commands:
# Linux command
chmod +x load_data.sh
./load_data.sh
Link to the directory with all weights that are used in this work.