Machine Learning Approaches for Bookstore Inventory Management: Book Rating Predictive System and Segmented Book Recommendation System
This project explores the application of machine learning techniques in revolutionizing bookstore management strategies and customer interactions. Leveraging data analysis and preprocessing, decision tree analysis, and K-Means clustering, the project aims to enhance decision-making processes and improve business management practices within the bookstore industry.
The project begins with a comprehensive data preprocessing phase, addressing missing values, anomalies, and standardizing textual data. This ensures the dataset is clean, consistent, and ready for analysis. Subsequently, a decision tree prediction system is implemented to predict book ratings, enabling informed inventory management decisions.
Additionally, K-Means clustering is employed for books segmentation, facilitating tailored book recommendations based on customer preferences. Through feature selection, model training, and evaluation, the project demonstrates the transformative potential of machine learning in optimizing bookstore operations and enriching customer experiences.
The project report, titled "Machine Learning Techniques for Bookstore Inventory Management: Decision Trees for Book Rating Prediction and K-Means Clustering for Book Segmentation," provides a detailed analysis and documentation of the entire project. The report includes:
-
Abstraction:
- Overview of the significance of machine learning in business decision-making, specifically in the context of a bookstore.
- Explanation of the methodologies used, including decision tree analysis and K-Means clustering, and their practical applications in predicting book ratings and segmenting books for customer recommendations.
-
Introduction:
- Detailed introduction to the project objectives, highlighting the importance of data preprocessing in ensuring the quality and reliability of the dataset.
- Description of the decision tree and K-Means clustering techniques employed in the project.
-
Methodology:
- Comprehensive explanation of the data preprocessing steps, including handling missing data, text processing, and data discretization.
- Detailed description of the decision tree prediction system, covering feature selection, data preparation, model training, cross-validation, and evaluation.
- Explanation of the K-Means clustering approach for book segmentation, including encoding of book titles, preprocessing, and determining the optimal number of clusters.
-
Data Exploration and Analysis:
- Analysis of feature importance and distribution of ratings and rating categories.
- Examination of cluster composition and the impact of highly rated book filtering on recommendations.
-
Results:
- Presentation of the decision tree model's accuracy and implications.
- Discussion of the clustering analysis results, highlighting the composition of clusters and the effectiveness of the recommendation strategy.
-
Limitations and Improvements:
- Identification of limitations in data preprocessing and the decision tree model, along with potential improvements such as data augmentation, resampling techniques, and alternative modeling approaches.
- Suggestions for enhancing the clustering method by incorporating book content analysis and considering different encoding methods and clustering techniques.
-
Conclusion:
- Summary of the study's findings, emphasizing the practical applications of machine learning in bookstore management.
- Recommendations for future advancements to address data imbalances, refine outlier management strategies, and explore innovative methodologies.
The preprocess.py
file is an essential component of the project, responsible for preprocessing the raw data before feeding it into the machine learning models. This script employs various preprocessing techniques to ensure the quality and integrity of the dataset, making it suitable for analysis and model training.
-
weighted_age_dect
- This function calculates the weights for different age groups based on the distribution of user ages in the raw data.
- Returns a dictionary containing the weights for each age group.
-
ages_imputation
- Utilizes the weights generated by the
weighted_age_dect
function to impute missing age data. - Ensures that the imputed age distribution closely matches that of the raw data.
- Handles erroneous age data exceeding 100 in a consistent manner.
- Utilizes the weights generated by the
-
country_imputation
- Imputes missing country data by randomly selecting countries from the known range of countries in the dataset.
- Ensures that only valid country names are selected to maintain data integrity.
-
city_imputation
- Imputes missing city data by randomly selecting cities from the known range of cities in the dataset.
- Ensures that only valid city names are selected to maintain data integrity.
-
state_imputation
- Imputes missing state data by randomly selecting states from the known range of states in the dataset.
- Ensures that only valid state names are selected to maintain data integrity.
-
author_imputation
- Fills missing author data for books with "NO AUTHOR" to maintain consistency in the dataset.
-
discretising
- Discretizes user ages into 10-year bins, ranging from 0 to 100.
- Discretizes ratings into three categories: "low," "medium," and "high."
- Merges the DataFrames of book ratings, book information, and user information.
-
text_process
- Normalizes text data by transforming all text into lowercase.
- Handles mismatches in country formats to ensure uniformity in the dataset.
-
compute_probability
- Computes the probability of a feature in the dataset.
-
compute_entropy
- Computes the entropy of a feature in the dataset.
-
compute_conditional_entropy
- Computes the conditional entropy of two features in the dataset.
-
compute_information_gain
- Computes the information gain of a feature with respect to a target variable in the dataset.
The Final.ipynb
notebook encapsulates the entire project workflow, from data preprocessing to the implementation and evaluation of machine learning models for bookstore management. Below is a breakdown of the contents and functionality of this notebook:
The initial section of the notebook focuses on preparing the raw dataset for analysis and modeling. Key preprocessing steps include:
-
Missing Data Handling:
- Abnormal data points, such as extreme ages, are removed.
- Missing age values are imputed based on the distribution of age groups in the dataset.
- Missing location data (country, city, state) is imputed with valid and realistic values.
-
Text Processing:
- Standardization of country names and author information.
- Removal of special characters and normalization of text data.
-
Data Integration and Discretization:
- Discretization of ages and ratings into meaningful categories.
- Merging of datasets and final validation for missing values.
This section details the implementation of a decision tree model for predicting book ratings. Key steps include:
-
Feature Selection:
- Calculation of Information Gain to select the most discriminative features.
-
Data Preparation:
- Encoding of categorical features using OrdinalEncoder.
-
Model Training:
- Utilization of an entropy-based decision tree model for training.
-
Cross Validation:
- Ten-fold cross-validation for model validation and evaluation.
The following part of the notebook focuses on books segmentation using K-means clustering. Key steps include:
-
Selection of K-means Clustering:
- Choice of K-means due to its efficiency and clear cluster boundaries.
-
Encoding of Book Titles:
- Utilization of Bag-of-Words (BoW) technique for numerical representation.
-
Preprocessing of Book Titles:
- Removal of punctuation, stop-words, and lowercase conversion.
-
Execution of Elbow Method and K-means Clustering:
- Determination of optimal cluster number using the elbow method.
- Execution of K-means clustering to group similar books.
-
Recommendation Strategy:
- Selection of high-rated books within clusters for tailored recommendations.
-
ConfusionMatrix.png:
- This figure shows the confusion matrix of decision tree.
-
DistributionOfAgeGroups.png:
- This figure shows the distribution of users' age groups after imputation and discretising.
-
DistributionOfAgesRawData:
- This figure shows the distribution of users' ages.
-
DistributionOfRatingCate.png:
- This figure shows the distribution of books ratings after discretising from training set and test set.
-
DistributionOfRatings.png:
- This figure shows the distribution of books ratings.
-
K-ElbowForBookTitles.png:
- This figure shows the K-Elbow method while applying K-Mean technique.
BX-Books.csv
- ISBN: International Standard Book Number, a unique identifier for books.
- Book-Title: Title of the book.
- Book-Author: Author(s) of the book.
- Year-Of-Publication: Year when the book was published.
- Book-Publisher: Publisher of the book.
- Total Rows: 18,185
BX-Ratings.csv
- User-ID: Unique identifier for users.
- ISBN: International Standard Book Number, a unique identifier for books.
- Book-Rating: Rating given by users to books.
- Total Rows: 204,146
BX-Users.csv
- User-ID: Unique identifier for users.
- User-City: City where the user is located.
- User-State: State where the user is located.
- User-Country: Country where the user is located.
- User-Age: Age of the user.
- Total Rows: 48,299
BX-NewBooks.csv
- ISBN: International Standard Book Number, a unique identifier for books.
- Book-Title: Title of the book.
- Book-Author: Author(s) of the book.
- Year-Of-Publication: Year when the book was published.
- Book-Publisher: Publisher of the book.
- Total Rows: 8,924
BX-NewBooks-Ratings.csv
- User-ID: Unique identifier for users.
- ISBN: International Standard Book Number, a unique identifier for books.
- Book-Rating: Rating given by users to books.
- Total Rows: 26,772
BX-NewBooks-Users.csv
- User-ID: Unique identifier for users.
- User-City: City where the user is located.
- User-State: State where the user is located.
- User-Country: Country where the user is located.
- User-Age: Age of the user.
- Total Rows: 8,520