Dataset: This Project aims at predicting behaviour patterns in young adults using supervised learning approach. The Young People Survey dataset is a survey conducted in the UK.The dataset is divided into 2 csv files.
Objective:
Below are the following questions are needed to be answered:
- Given the music preferences, do people make up any clusters of similar behavior?
- Do women fear certain phenomena significantly more than men? Do the left handed people have different interests than right handed?
- Can we predict spending habits of a person from his/her interests and movie or music preferences?
- Can we describe a large number of human interests by a smaller number of latent concepts?
- Are there any connections between music and movie preferences?
- How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?
- Small number of participants often cheats and randomly answers the questions. Can you identify them? Hint: Local outlier factor may help.
- Are there any patterns in missing responses? What is the optimal way of imputing the values in surveys?
- If some of user's interests are known, can we predict the other? Or, if we know what a person listen, can we predict which kind of movies he/she might like?
About the Dataset:
- In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey.
- The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).
- For convenience, the original variable names were shortened in the data file.
- The data contain missing values.
- The survey was presented to participants in both electronic and written form.
- The original questionnaire was in Slovak language and was later translated into English.
- All participants were of Slovakian nationality, aged between 15-30.
- The variables can be split into the following groups:
- Music preferences (19 items)
- Movie preferences (12 items)
- Hobbies & interests (32 items)
- Phobias (10 items)
- Health habits (3 items)
- Personality traits, views on life, & opinions (57 items)
- Spending habits (7 items)
- Demographics (10 items)
Dataset Cleaning:
Common data cleaning steps:
-
1) dropna() - Used to drop the columns where any element is nan.
-
1) Get_dummies () - This helps in converting categorical variables into dummy/indicator variables
-
Sample Code:
responses4 = pd.get_dummies(columns=['Smoking', 'Punctuality', 'Lying','Alcohol', 'Internet usage', 'Gender', 'Left - right handed', 'Education', 'Only child', 'Village - town', 'House - block of flats'],data=responses)
Following are the insights I plan to explore in this dataset:
- Lonliness - What are the factors contributing in a person's feeling of Lonliness.
- Life Struggles - Height and weight ? - Who is struggling more in their life.
- Drinkers vs Non-drinkers - Predicting based on person's choices and character traits.
To do the analysis on the above mentioned areas the following Machine Learning Techniques:
- Lonliness - Logistic Regression
- Life Struggles - Relationship with Height and Weight ? - Logistic Regression
- Drinkers and Non-drinkers - Decision Tree
Logestic Regression - Loneliness:
- In this analysis, I am trying to use logistic regression to check wether I can get an interesting insight related to Loneliness. Also, I would like to see which all features can be attributed towards Loneliness.
- For this analysis the following steps were performed:
- Null values were removed. This step was done as a common step at the top.
- The dummies variables were created from the categorical variables.
- Creating two data frames males and females for the purpose of visualising
Result:
How to predict a young person’s loneliness?:
- Firstly I transformed our original loneliness variable to a dummy variable.
- Loneliness > 3 are considered as lonely.
- Loneliness<=3 are considered not lonely.
- Using logistic regression model with 71 variables, I can predict a person’s loneliness level at an accuracy rate of 71% with (K-fold) test sets.
- The variables with largest coefficients can be found in above figure.
- Pepople who like writing, have fear of public speaking, surfs a lot on internet,enjoys using PC & spends a lot on gadgets can have high degree of Loneliness whereas if they like spending time with friends, likes active sports, geography, pets & entertainment are negatively related with Loneliness
Logestic Regression - Life Struggles:
-
In this analysis, I am trying to use logistic regression to check wether we can get an interesting insight related to Life Struggles.
-
Also,I would like to see which all features can be attributed towards Life struggles.
-
For this analysis the following steps were performed:
- Null values were removed. This step was done as a common step at the top.
- The dummies variables were created from the categorical variables.
- Finding the highest correlations.
- Trying to find the reasons behind those correlations.
Result:
-
The most correlated features are Height and Weight which are self-explanaroty.
-
People who are interested in Biology are also interested in Medicine and Chemistry.
-
The same for Fantasy/Fairy tales and Animated movies.
-
one might ask why there is a negative correlation between Life struggles, Weight and Height. Let's explore further to understand the reasons behind this correlation.
-
If i observe the correlations of Life struggles, and hight and weight, I will see that these three variables are highly correlated with gender
-
If you have more life struggles in life, you are probably a woman :) Of course, if you like shopping, you are more likely to be a woman.
-
But if you are tall, or weigh a lot or like "PC Software, Hardware" then you are probably a man.
-
Negative correlation between Life struggles and height/weight was due to the female and male separation.
Decision Tree:
- In this analysis, I have tried a decision tree as a machine learning technique to check wether I can get an interesting insight or not related to drinkers and non-drinkers.
- For this analysis the following steps were performed:
- Null values were removed. This step was done as a common step at the top.
- The dummies variables were created from the categorical variables.
- A function is written to replace the string with numbers in the field Alcohol and store it in a new column Alcohol2.
- A decision tree with depth 3 was created.
- It was surprising to find that teenagers of age 15-16 tend to drink more even though they are not legally allowed to do so. * Underage drinking is a serious public health problem in the society.
- According to a study, Britain has the fourth highest levels of underage drinking among 15-years-olds.
- Parents and public welfare organizations can work together to fight this issue by focusing on the factors in the above analysis.
Loneliness: The Co-op Foundations and charitable organizations working to prevent and tackle youth loneliness can take these factors into consideration to help young people form stronger connections within their community.
Life Struggles: Please take care of females! It is the moral responsibility of everyone to treat women with respect and dignity as they already have a lot of struggles.
Drinkers vs Non-drinkers: It was surprising to find that teenagers of age 15-16 tend to drink more even though they are not legally allowed to do so. Underage drinking is a serious public health problem in the society. According to a study, Britain has the fourth highest levels of underage drinking among 15-years-olds. Parents and public welfare organizations can work together to fight this issue by focusing on the factors in the above analysis.