Digikala Vector Search

This repository is a proof of concept for a semantic vector search engine for Digikala products. The search engine is based on Elasticsearch and uses the sentence-transformers library for embeddings.

Overview

The project aims to provide a semantic search capability for Digikala products. This is achieved by using Elasticsearch, a powerful open-source search and analytics engine, in combination with sentence-transformers, a Python framework for state-of-the-art sentence, text and image embeddings.

The search engine works by converting product titles into 1024-dimension vectors using intfloat/multilingual-e5-large transformer.

These vectors are then indexed in Elasticsearch. When a search query is made, it is also converted into a semantic vector and the closest matching vectors in the index are fetched using Cosine Similarity and K-nearest neighbors (KNN), and returned as the search results.

Setup and Run

Clone the repository

git clone https://github.com/ArmanJR/Digikala-Vector-Search
cd Digikala-Vector-Search

Create virtual environment and install dependencies

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Open indexData.ipynb and set environment variables:

ELASTIC_ENDPOINT: Local or remote Elasticsearch endpoint (free trial cluster available on Elastic Cloud)
ELASTIC_USERNAME: Username for the Elasticsearch cluster
ELASTIC_PASSWORD: Password for the Elasticsearch cluster
ELASTIC_INDEX: Index name for the products
DIGIKALA_DATASET_PATH: The products dataset path, not available in the git repository. Download from Kaggle: https://www.kaggle.com/datasets/radeai/digikala-comments-and-products
CUSTOM_DATASET_PATH: Create a custom dataset containing the examples and edge cases to merge with the original dataset. It should match the format of the Digikala dataset (id,title_fa,Rate,Rate_cnt,Category1,Category2,Brand,Price,Seller,Is_Fake,min_price_last_month,sub_category)
SAMPLE_COUNT: Number of samples to index from the full dataset (Warning: Since the vector size is 1024, setting this to a high number will consume a lot of memory and time. Keep it low for testing purposes)
RANDOM_STATE: Random seed for sampling the dataset

Run the notebook indexData.ipynb step by step to index the data
Open searchApp.py and set environment variables
Run the search app

streamlit run searchApp.py

Further Improvements

Use a more powerful and lightweight transformer model (preferably fine-tuned on Persian content) for better embeddings
Include images in embeddings for multimodal search
Implement a feedback loop to improve search results over time

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
doc		doc
.gitignore		.gitignore
README.md		README.md
indexData.ipynb		indexData.ipynb
indexMapping.py		indexMapping.py
requirements.txt		requirements.txt
searchApp.py		searchApp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digikala Vector Search

Overview

Setup and Run

Further Improvements

License

About

Languages

ArmanJR/Digikala-Vector-Search

Folders and files

Latest commit

History

Repository files navigation

Digikala Vector Search

Overview

Setup and Run

Further Improvements

License

About

Resources

Stars

Watchers

Forks

Languages