Skip to content

The objective of the Data Analytics internship at CSRBOX is to provide interns with hands-on experience in applying data analytics techniques to real-world projects in the field of corporate social responsibility (CSR). Interns will gain practical skills in data collection, cleaning, analysis, visualization, and reporting, while working on projects

Notifications You must be signed in to change notification settings

Yash22222/IBM-CSRBOX-Internship-Project

Repository files navigation

Web Scraping for Data Analysis & Predictive Model on Customer's Data by Tech Titans

Table of Contents

Web Scraping for Data Analysis

Web Scraping

Web scraping was employed to gather customer reviews and insights about Air India from the website Airline Quality. Using web scraping techniques, data was extracted from the website, including customer comments, ratings, and other relevant information, which was then compiled into the "Reviews Dataset" for further analysis, such as predicting customer buying behaviors or understanding customer sentiments towards Air India's services.

Data Preprocessing

Data preprocessing is crucial in the data mining process, involving cleaning, transforming, and integrating data for analysis. Its goal is to enhance data quality and suitability for specific tasks

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

  • Removal of sentences before '|' in dataframe
  • Removal of all special characters from the dataframe

Tokenization

  • Tokenization is the process of dividing text into a set of meaningful pieces.
  • Tokens converted to tuples using POS Tagging, grouped into words through lemmatization.

Sentiment Analysis

Sentiment analysis is the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral.

VADER

  • VADER(Valence Aware Dictionary for Sentiment Reasoning) is an NLTK module that provides sentiment scores based on the words used.
  • It is a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive, negative or neutral.

Data Visualization

Data visualization uses graphics like charts, plots, infographics, and animations to represent complex data relationships and provide easy-to-understand insights.

via. Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

via. WordCloud

Wordcloud is basically a visualization technique to represent the frequency of words in a text where the size of the word represents its frequency.

Predictive Modelling on Customer's Data

Predictive models are machine learning algorithms trained on high-quality customer data, requiring manipulation and preparation to accurately predict target outcomes.

Exploratory Data Analysis

  • Exploratory Data Analysis is a crucial step in the data analysis process, where the primary goal is to understand the data, gain insights, and identify patterns or relationships between variables.
  • Imported Chardet library(Universal Character Encoding Detector) for UTF-8 encoded code, applied to CSV, and checked for null values.

Mutual Information graphs

  • MI score graphs visualize feature relevance to the target variable, measuring dependency and aiding feature selection.
  • The scikit-learn (sklearn) library calculates the MI_score correlation between attributes, and a graph is plotted for visualization purposes.

Test and Train Model

  • Test and train split is a crucial step in building and evaluating machine learning models, dividing datasets into training and test sets.
  • Training sets contain 70-80% of data, while test sets allocate 20-30%.
  • The code splits data into training, validation, and testing sets, ensuring model training, validation, and testing on different subsets, preventing overfitting, and providing a reliable evaluation.

MinMaxScaler

Min-Max Scaling is a preprocessing technique for scaling numerical features to a fixed range, ensuring consistent scaling across all features.

via. Random Forest Classifier

Random Forest is an ensemble learning method combining multiple decision trees, capturing complex relationships and interactions for more accurate and robust models.

  • For top-6 features (Accuracy = 74.5762)
  • For all features (Accuracy = 71.1864)

via. XGB(Extreme Gradient Booster) Classifier

XGBoost is a popular machine learning algorithm utilizing gradient boosting to optimize model performance and computational efficiency.

  • For top-6 features (Accuracy = 71.1864)
  • For all features (Accuracy = 71.1864)

Validate Model

Validating the model on the test dataset is an essential step in the machine learning workflow to assess how well the model performs on unseen data.

  • Accuracy = 71.1864

Conclusion

The Random Forest classifier with the top 6 features showed slightly higher accuracy than XGBoost. It can predict customer satisfaction or other target variables in datasets. Performance may vary depending on data quality and representativeness.

Libraries Utilized

  • BeautifulSoup (bs4)
  • Chardet
  • Matplotlib
  • Natural Language Toolkit (nltk)
  • Numpy (np)
  • Pandas (pd)
  • Requests (re)
  • Seaborn (sns)
  • Scikit-learn (sklearn)
  • VaderSentiment (SentimentIntensityAnalyzer)
  • Warnings
  • WordCloud

About

The objective of the Data Analytics internship at CSRBOX is to provide interns with hands-on experience in applying data analytics techniques to real-world projects in the field of corporate social responsibility (CSR). Interns will gain practical skills in data collection, cleaning, analysis, visualization, and reporting, while working on projects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published