GitHub - MianWang123/Information-Retrieval: Given a set of Amazon Reviews, the task is to find all the similar reviews & reviewerIDs in the Amazon Reviews, and to find similar reviews in the database to any new input review.

Project Title

Information Retrieval (Locally Sensitive Hashing)

Goal and Process

The goal is to find similar pairs of reviews in the AmazonReviews dataset(108 Mb), also given any new review, this code will find the most similar one in the database.

The process involves data preprocess (get rid of stopwords and punctuations), K-shingling of reviews (K=5), locally sensitive hashing (m times permutation & min-hashing & divided by b bands), compute Jaccard distance (1 - Jaccard similarity), and find similar pairs of reviews.

Introduction

The 'lsh.py' takes 16 minutes to run in total (via Google Colab).
After completion of each step (11 steps in total), there would be a reminder.
In step 11, you can change the default input review to anything you want (string longer than 5) to search for similar review.

Prerequisites

Put 'amazonReviews.json' in the same docment with 'lsh.py'.
The dataset can be found here https://drive.google.com/file/d/1UMAL2OULAEpdhlSUSgtUMy7ErxVuYNdR/view?usp=sharing

Data Visualization

For Step 4, I randomly picked 10000 pairs of reviews, and draw the distribution of their Jaccard distance, from which we can have a glimpse at how the whole AmazonReviews distinguish from each other.

For Step 5, The graph of probability of hit vs similarity with different parameters is plotted here so as to choose appropriate parameters, i.e. m permutations & b bands.

For Step 9, I draw the distribution of Jaccard similarity in neareast duplicates, we can see that their Jaccard Similarity are very high since they are all similar pairs.

For Step 11, if the input review is 'Love is the most important thing', the most similar review in the database would be found, together with ReviewerID and Jaccard similarity, which is listed below.

          ReviewerID: A3NTOYUJYVKOV7 
          ReviewText: "cats seem love important thing good sits dish" 
          Jaccard Similarity: 0.4117647058823529

Acknowledge

Special thanks to ESE545 course's professor for providing the data set

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
pics		pics
lsh.py		lsh.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Goal and Process

Introduction

Prerequisites

Data Visualization

Acknowledge

About

Releases

Packages

Languages

MianWang123/Information-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Project Title

Goal and Process

Introduction

Prerequisites

Data Visualization

Acknowledge

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages