🛩️ Airline Satisfaction Predictors 🚁

The airline industry is undeniably massive with an annual revenue exceeding $800B and 6M travelers per day. A team of four computer engineers from Cairo university have taken it upon themselves to find out a recipe for the perfect airline company by answering the paramount question “What makes airline customers satisfied?”. The question is posed as both a data analysis problem and a machine learning problem that together answer it via exploratory analytics, association rule mining and predictive models.

🚀 Pipeline

Our approach to said problem utilized the following pipeline

📂 Folder Structure

The following is the implied folder structure:

.
├── DataFiles
│ ├── airline-train.csv
│ └── airline-val.csv
├── DataPreparation
│ ├── DataPreparation.py
│ └── Visualization.ipynb
└── ModelPipelines
├── Apriori
│ ├── Apriori.ipynb
│ └── rules.txt
├── NaiveBayes
│ ├── NaiveBayes.ipynb
│ └── NaiveBayes.py
├── RandomForest
│ ├── RandomForest.ipynb
│ └── RandomForest.py
└── SVC
├── SVC.ipynb
└── SVC.py

🚡 Running the Project

pip install requirements.txt
# To run any stage of the pipeline, consider the stage's folder. There will always be a demonstration notebook.

We have also run the project on the cloud via Azure ML studio. You may contact the developers if you find any issue in that.

🛫 Data Preparation

We used Kaggle's Airline satisfaction dataset as shown below

	Age	Flight Distance	Departure Delay in Minutes	Arrival Delay in Minutes	Gender	Customer Type	Type of Travel	Class	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness
0	13	460	25	18.0	Male	Loyal Customer	Personal Travel	Eco Plus	3	4	3	1	5	3	5	5	4	3	4	4	5	5
1	25	235	1	6.0	Male	disloyal Customer	Business travel	Business	3	2	3	3	1	3	1	1	1	5	3	1	4	1
2	26	1142	0	0.0	Female	Loyal Customer	Business travel	Business	2	2	2	2	5	5	5	5	4	3	4	4	4	5

Our data preparation module was implemented using Pandas and PySpark and supports the following:

Reading a specific split of the data (training or validation)
Reading specific column types from the data (numerical, ordinal or categorical)
Frequency Encoding for categorical features
Dropping missing values
Imputing numerical outliers Alternatives for the function were implemented as well in case any model required further special preprocessing.

🎨 Exploratory Data Analytics

Instead of querying the dataset for specific facts, we priotized that it should tell us all the facts. In other words, we have let the data speak for itself and for that we designed the following analysis workflow

Analysis Stage	Components
Univariate Analysis	Prior Class Distribution
	Basics Features Involved
	Missing Data Analysis
	Central Tendency & Spread
	Feature Distirbutions
	Feature Distributions per Class
Correlations & Associations	Dependence between Categorical Features
	Correlations between Categorical & Numerical Features
	Monotonic Association between Ordinal Variables
	Correlations between Numerical Features
	Naive Bayes Assumption
Multivariate Analysis	Separability & Distribution of Numerical Features
	Separability & Distribution of Numerical Feature Pairs
	Separability & Distribution of Numerical Feature Trios
	Seprability and Distribution of Numerical and Categorical Pairs
	Seprability and Distribution of Categorical Pairs

In the following we will take you through a cursory glance of the workflow, the set of insights that each visual corresponds to (over 50 insights in total) can be found in the demonstration notebook or the report along with full versions of the visuals and tables.

🪂 Univariate Analysis

◉ Prior Class Distribution

In this we study if there is any imbalance among the satisfaction levels of people involved in the dataset.

◉ Basics of Each Feature

This provides a description, type and possible values for each feature.

Variable Name	Variable Description	Variable Type	Values
Gender	Gender of the passengers	Nominal	Female, Male
Customer Type	The customer type	Nominal	Loyal customer, Disloyal customer
Age	The actual age of the passengers	Numerical	-
Type of Travel	Purpose of the flight of the passengers	Nominal	Personal Travel, Business Travel
Class	Travel class in the plane of the passengers	Nominal	Business, Eco, Eco Plus
Flight Distance	The flight distance of this journey	Numerical	-
Inflight wifi service	Satisfaction level of the inflight wifi service	Ordinal	1, 2, 3, 4, 5
Departure/Arrival time convenient	Satisfaction level of Departure/Arrival time convenient	Ordinal	1, 2, 3, 4, 5
Ease of Online booking	Satisfaction level of online booking	Ordinal	1, 2, 3, 4, 5
Gate location	Satisfaction level of Gate location	Ordinal	1, 2, 3, 4, 5
Food and drink	Satisfaction level of Food and drink	Ordinal	1, 2, 3, 4, 5
Online boarding	Satisfaction level of online boarding	Ordinal	1, 2, 3, 4, 5
Seat comfort	Satisfaction level of Seat comfort	Ordinal	1, 2, 3, 4, 5
Inflight entertainment	Satisfaction level of inflight entertainment	Ordinal	1, 2, 3, 4, 5
On-board service	Satisfaction level of On-board service	Ordinal	1, 2, 3, 4, 5
Leg room service	Satisfaction level of Leg room service	Ordinal	1, 2, 3, 4, 5
Baggage handling	Satisfaction level of baggage handling	Ordinal	1, 2, 3, 4, 5
Check-in service	Satisfaction level of Check-in service	Ordinal	1, 2, 3, 4, 5
Inflight service	Satisfaction level of inflight service	Ordinal	1, 2, 3, 4, 5
Cleanliness	Satisfaction level of Cleanliness	Ordinal	1, 2, 3, 4, 5
Departure Delay in Minutes	Minutes delayed when departure	Numerical	-
Arrival Delay in Minutes	Minutes delayed when Arrival	Numerical	-
Satisfaction	Airline satisfaction level	Nominal	Satisfaction, Neutral, Dissatisfaction

◉ Missing Data Analysis

In this, we analyze each feature for missing data.

	Age	Flight Distance	Departure Delay in Minutes	Arrival Delay in Minutes	Gender	Customer Type	Type of Travel	Class	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness
Missing Count	0	0	0	310	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

◉ Central Tendency and Spread

For each type of feature, we consider measures of central tendency and spread. This is an example for numerical features.

	Age	Flight Distance	Departure Delay in Minutes	Arrival Delay in Minutes
count	103904.000000	103904.000000	103904.000000	103594.000000
mean	39.379706	1189.448375	14.815618	15.178678
std	15.114964	997.147281	38.230901	38.698682
min	7.000000	31.000000	0.000000	0.000000
25%	27.000000	414.000000	0.000000	0.000000
50%	40.000000	843.000000	0.000000	0.000000
75%	51.000000	1743.000000	12.000000	13.000000
max	85.000000	4983.000000	1592.000000	1584.000000

◉ Feature Distributions

Here we analyze the distributions of each feature to look for any special patterns.

◉ Feature Distributions per Class

This provides the same analysis as above, but per class.

🪂🪂 Correlations

◉ Dependence between Categorical Features

We have used the Chi-square test of independence which tests if there is a relationship between two categorical variables. In particular, we have that

$H_0$: The two categorical variables are independent.

$H_1$: The two categorical variables are dependent.

Here, we set $\alpha = 0.05$ and hence, if the p-value for the test done on two variables is less than $0.05$, we reject the null hypothesis and conclude that the two variables are dependent.

	Gender	Type of Travel	Class
Gender	0.0	0.026398	0.000119
Customer Type	0.0	0.0	0.0
Type of Travel	0.026398	0.0	0.0
Class	0.000119	0.0	0.0

◉ Associationss between Categorical & Numerical Features

We have used Pearson's correlation ratio to find associations between all possible numerical and categorical features

◉ Monotonic Association between Ordinal Variables

We have use Spearman's to find monotonic associations between ordinal variables

◉ Correlations between Numerical Features

We have use Pearson's to study linear correlation between numerical variables

◉ Naive Bayes Assumption

Related to dependence is also testing the Naive bayes assumption. We hae provided an automated way for that with the sample output:

As expected, the Naive Bayes assumption does not hold. In particular, we have that $$P(x_1, x_2, ...|C_1=0)=0.16$$ as computed numerically using the definition of the probability. Meanwhile, applying the Naive Bayes assumption we have that $$P(x_1, x_2, ...|C_1=0)=P(x_1|C_1=0)P(x_2|C_1=0)...=0.08$$ which is different from the correct probability.

Likewise, for the class $C_1=1$ we have that $$P(x_1, x_2, ...|C_1=1)=0.07$$ but $$P(x_1, x_2, ...|C_1=1)=P(x_1|C_1=1)P(x_2|C_1=1)...=0.04$$ which is different from the correct probability.

This does not stop up from using the model as research has demonstrated that it can be robust to the assumption.

🪂🪂🪂 Multivariate Analysis

◉ Seperability and Distribution of Numerical Features

We prepared box plots to study the distribution of the numerical features for each class in the target

◉ Seperability and Distribution of Numerical Pairs

Then we started analyzing all pairs to see if separability gets better

◉ Seperability and Distribution of Numerical Trios

Then we considered all trios

◉ Seprability and Distribution of Numerical and Categorical Pairs

We then sought interaction with all possible numerical features. The plot is much longer in the notebook.

◉ Seprability and Distribution of Categorical Pairs

And interaction among all possible categorical features. You must have enough RAM to see the full version in the notebook.

🗿 Model Building & Evaluation

The abundance of categorical features and their significance as shown above has inspired considering Naive Bayes and Random Forest as predictive models. We later follow up with an SVM model as ordinal features can be assumed as numerical as well. But before employing such models we considered topping off our exploratory data analytics with association rule learning.

🔮 Apriori Model

For this, numerical features (excluding extremely skewed ones) were converted to categorical ones by binning and then all the categorical features were one-hot encoded.

The following shows a sample of the strongest rules found:

	antecedents	consequents	support	confidence	lift
1	Class = Eco	satisfaction = neutral or dissatisfied	0.366146	0.813862	1.436226
2	Type of Travel = Business travel, Customer Type = Loyal Customer	satisfaction = satisfied	0.358793	0.705553	1.628201
3	Age = adult, Type of Travel = Business travel	satisfaction = satisfied	0.388281	0.598531	1.381228

Where the corresponding graphic over all strong rules is

This is color plot of support against the confidence where the color represents lift and the number refers to a rule in the table.

🎲 Naive Bayes

For Naive Bayes, we started with preprocessing the data by converting the string categories to integers and bucketizing the four numerical variables based on the 10 percentiles (10%, 20%, 30%,...) so that they can be treated similar to categorical variables. We followed by implementing NaiveBayes on PySpark from scratch:

Since it holds by Bayes rule: $$P(C \mid A=(a_1,a_2,\ldots,a_M)) = \frac{P(A=(a_1,a_2,\ldots,a_M) \mid C)P(C)}{P(A=(a_1,a_2,\ldots,a_M))}$$

Since it holds by the naive assumption: $$P(A=(a_1,a_2,\ldots,a_M) \mid C) = \prod_{m=1}^{M} P(a_m \mid C)$$

By the constant denominator in Bayes: $$P(C_i \mid A)=(a_1,a_2,\ldots,a_M) \propto P(A=(a_1,a_2,\ldots,a_M) \mid C)P(C)$$

Hence, the most likely class is given by: $$C = \max_{1 \leq i \leq K} { P(C_i) \prod_{m=1}^{M} P(a_m \mid C_i)}$$

where $P(C_i)$ and $P(a_m \mid C_i)$ are easilt computed by counting.

This has yields an accuracy of $89.1%$ in terms of predicting customer satisfaction.

🌲 Random Forest

We also considered initiating a Random Forest model which did not further require any special processing (beyond NB). Luckily, PySpark’s RandomForest inherently supports both categorical and numerical features after applying the model the perceived accuracy on the validation set was $96%$.

We analyzed the average feature importance set by trees in the forest to yield

📏 SVM Model

We topped off with a linear SVM model and hyperparameter search but results were not as significant as the random forest.

🛬 Result Interpreation

The totality of the analyses above have led us conclude the following:

There is a lack of satisfaction in airline travel experience (imbalance)
Such lack is focused on economy travelers
Wifi Service, Entertainment and OnlineBoarding are key determinants of satisfaction
Comfort and ease of booking also matter
Distance and delays seem to have a less adverse effect

👥 Collaborators

📈 Progress Tracking

We have utilized Notion for progress tracking and task assignment among the team.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
Assets		Assets
DataFiles		DataFiles
DataPreparation		DataPreparation
ModelBuilding		ModelBuilding
ModelPipelines		ModelPipelines
.gitignore		.gitignore
Presentation.pptx		Presentation.pptx
README.md		README.md
Report.pdf		Report.pdf
requirements.txt		requirements.txt

NouranHany/Airline-Satisfaction-Predictors

Folders and files

Latest commit

History

Repository files navigation