I’m a data analyst, and the Chief Marketing Officer has told me that previous marketing campaigns have not been as effective as they were expected to be. I need to analyze the data set to understand this problem and propose data-driven solutions.
- Performed
Data Exploration
andSegmentation
to study the key characteristic features ofPotential customers
using Python. - Executed
Hypothesis Testing
,Statistical Analysis
,Anomaly Detection and Trend Analysis
, andFeature Engineering
to identify sequential demand patterns and Customer Behavior. - Calculated the top 10 Informative features using
Random Forest Feature Importance
. - Detected
Successful Marketing Channels
andUnderperforming Marketing Campaigns
. - Designed Interactive dashboards using
Flourish
to analyze the trends, give recommendations and publish the findings.
The dataset for this project is provided by Dr. Omar Romero-Hernandez. It is licensed as CC0: Public Domain.
The dataset is collected from kaggle - https://www.kaggle.com/datasets/jackdaoud/marketing-data?taskId=2986
- ID: Customer’s unique identifier
- Year_Birth: Customer’s birth year
- Education: Customer’s education level
- Marital_Status: Customer’s marital status
- Income: Customer’s yearly household income
- Kidhome: Number of children in customer’s household
- Tennhome: Number of teenagers in customer’s household
- Dt_Customer: Date of customer’s enrollment with the company
- Recency: Number of days since customer’s last purchase
- MntWines: Amount spent on wine in the last 2 years
- MntFruits: Amount spent on fruits in the last 2 years
- MntMeatProducts: Amount spent on meat in the last 2 years
- MntFishProducts: Amount spent on fish in the last 2 years
- MntSweetProducts: Amount spent on sweets in the last 2 years
- MntGoldProds: Amount spent on gold in the last 2 years
- NumDealsPurchase: Number of purchases made with a discount
- NumWebPurchase: Number of purchases made through the company’s website
- NumCatalogPurchase: Number of purchases made using a catalog
- NumStorePurchase: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 2 if customer accepted the offer in the 1st campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- Complain: 1 if a customer complained in the last 2 years, 0 otherwise
- Country: Customer’s location
-
Renamning Columns
-
String Manipulations
-
Missing values (Imputing with Median)
-
DataType Conversion
-
Outlier Analysis (Replacing with median)
-
Feature Engineering
-
Anomaly Detection and Trend Analysis
-
Correlation Analysis
-
Calculating
r
andP-values
-
Feature Importance (Using Random Forest)
We can now see that the top 7 factors are:
- Average order volume
- Total amount spent in the last two years
- Total number of purchases in the last two years
- Amount spent on wine in the last 2 years
- Number of purchases made using a catalog
- Number of visits to company's web site in the last month
- Total number of purchases through website in the last two years
However, we can’t tell whether each factor is positively or negatively correlated to the number of store purchases. We can use SHAP to explain it.
Finding:
- The number of store purchases increases with the higher total amount spent(Total_Mnt), higher total purchase amount(Total_num_purchase), higher AOV, and higher amount of wines purchases(MntWines).
- The number of store purchases decreases with more website visits(NumWebVisitsMonth), a higher number of purchases through the catalog(NumCatalogPurchases), and a higher number of purchases through websites(NumWebPurchases).
Summary:
People who mostly shop at stores tend to buy more wines, have a higher average order volume, and shop less through the internet or catalog.
As we can see, there is a very vague trend that says as MntGoldProds increases, NumStorePurchases also increases. Now, let’s look at the correlation test.
- Pearson correlation (r): 0.38326418634704296
- Pearson p-value: 3.4668974417790955e-79
We got a Pearson correlation of 0.38 and a p-value of almost zero, which states that they are statistically significant and have a positive correlation. (If the p-value is > 0.05, we will fail to reject the null hypothesis, where they do not correlate.)
This plot shows that the rest of the customers spent more on fish products as its 50th percentile is higher than the married Ph.D. group.
Response means the last marketing campaign, which is the most successful one. It performed nearly twice as well as the previous campaigns, except campaign 2.
An average customer...
- has an annual income of 52200 dollars
- had purchased 49 days ago
- has an AOV of 26.8 dollars
- has spent 605 dollars
- has purchased 20 times
- became a customer in mid-June
- became a customer on Thursday
- spent most on wines(300 dollars) and then * meat products(165 dollars)
- spent least on fruit(26 dollars) and sweet products(27 dollars)
- Logistic Regression -
Fitting 10 folds for each of 20 candidates, totalling 200 fits
{'C': 1.1, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Train score: 87.5
Logistic Regression Cross validation score: 89.05527915975678
Testing Accuracy: 85.71428571428571
- Support Vector Machine -
Fitting 10 folds for each of 60 candidates, totalling 600 fits
{'C': 0.8, 'degree': 3, 'kernel': 'linear'}
Support Vecktor Classification Cross validation score: 88.98231066887783
Testing Accuracy: 86.49553571428571
- XGBoost -
Fitting 10 folds for each of 24 candidates, totalling 240 fits
{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
XGB Classifier Cros validation score: 89.2072968490879
Testing Accuracy: 86.49553571428571
- Random Forest Regressor -
Fitting 5 folds for each of 36 candidates, totalling 180 fits
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=5),
param_grid={'criterion': ['gini'], 'max_depth': [None, 3, 5, 8],
'max_features': ['auto'],
'min_samples_split': [2, 3, 4],
'n_estimators': [50, 100, 200]},
verbose=1)
Testing Accuracy: 0.8616071428571429
We have computated feature importance using Recurssive Feature Elimination method (along with the best params selected for random forest regressor).
Patterns:
-
High-Income People — tend to spend more and purchase more. — tend to visit the company’s website less frequently than other people. — tend to has few numbers of purchases made with a discount
-
People having kids at home — tend to spend less and purchase less. — tend to has a high number of purchases made with a discount
-
People who purchased with high average order volume — tend to buy more wines and meat products — tend to make a high number of purchases made using a catalog — tend not to visit the company’s website.
Anomalies:
- Intuitively, I’d think the more complaints a customer has, the less they may spend on our store, but the number of complaints in the last two years has almost no correlation with the total amount spent in the last two years.
After further investigating the data, I found that it is because we only have 20 customers who complained in the last two years, but we have 2200 customers in total. So, because of the imbalanced ratio, they don’t correlate. The customer service department in the company has done a wonderful job in the last two years.