Company name matcher

Comparison of company names and search for similar companies in our database

Pipeline

Project structure

Our solution consists of several main parts:

Сlassifier - solving a classification problem to determine whether the names are one firm
Recommendation - solving the recommendation problem in order to suggest top n similar company names for the name of one company

There is a strict interface for each of the parts. Each part is independent of the other. A special interface has been implemented for this module.

This rep presents three methods for solving the problem:

Using Bert
Using Sentence Transformers
Using FastText

.
├── data
├── notebooks               <- Jupyter notebooks
├── README.md               <- The top-level README for developers using this project
├── requirements.txt        <- The requirements file for reproducing the analysis environment
├── weights                 <- Empty folder for saving results
├── src
│   ├── bert                <- Folder that contains bert solution
│   ├── fasttext            <- Folder that contains fasttext solution
│   ├── sentence_bert       <- Folder that contains tentence transformers solution
│   ├── utils
└── tutorial.ipynb          <- Demonstration work

MLFlow

To track the results of experiments, we used MLFlow. MLFlow - an open source platform for the machine learning lifecycle. We used the architecture shown below in the picture.

With the help of MLFlow, the following tasks were solved:

Collect summary information
Save artifacts
Manage machine learning models (in process)

Metrics

Classification task

Model	F1 Macro Score
Bert	0.97
Sentence Bert	0,61
FastText	0.87

Sentence Bert has worse results than Bert due to the peculiarities of the models. Sentence Bert is used to build embeds and cosine distance is calculated from them, and the names of companies that had similar words will have a similar representation of embeds.

Usage

We tested three different models:

bert
sentence transformer
FastText

Sava artifacts You can combine them however you like. Be careful with experiments, look at the results.

To demonstrate the results of the project, you can use a tutorial.ipynb Before using it, you need to install the project dependencies:

pip install -r requirements.txt

After installing the dependencies, you need to be in the root folder of the repository run commands:

# Linux command
chmod +x load_data.sh
./load_data.sh

Link to the directory with all weights that are used in this work.

Reference

Bert
SentenceTransformers
fastText
Contrastive Loss
Contrastive Loss Explaines
Dimensionality Reduction by Learning an Invariant Mapping
Label Smoothing Cross Entropy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Company name matcher

Pipeline

Project structure

MLFlow

Metrics

Usage

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Company name matcher

Pipeline

Project structure

MLFlow

Metrics

Usage

Reference