Young People Survey

I. Introduction:

Dataset: This Project aims at predicting behaviour patterns in young adults using supervised learning approach. The Young People Survey dataset is a survey conducted in the UK.The dataset is divided into 2 csv files.

Objective:

Below are the following questions are needed to be answered:

Given the music preferences, do people make up any clusters of similar behavior?
Do women fear certain phenomena significantly more than men? Do the left handed people have different interests than right handed?
Can we predict spending habits of a person from his/her interests and movie or music preferences?
Can we describe a large number of human interests by a smaller number of latent concepts?
Are there any connections between music and movie preferences?
How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?
Small number of participants often cheats and randomly answers the questions. Can you identify them? Hint: Local outlier factor may help.
Are there any patterns in missing responses? What is the optimal way of imputing the values in surveys?
If some of user's interests are known, can we predict the other? Or, if we know what a person listen, can we predict which kind of movies he/she might like?

II. Data Preparation:

About the Dataset:

In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey.
The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).
For convenience, the original variable names were shortened in the data file.
The data contain missing values.
The survey was presented to participants in both electronic and written form.
The original questionnaire was in Slovak language and was later translated into English.
All participants were of Slovakian nationality, aged between 15-30.
The variables can be split into the following groups:
Music preferences (19 items)
Movie preferences (12 items)
Hobbies & interests (32 items)
Phobias (10 items)
Health habits (3 items)
Personality traits, views on life, & opinions (57 items)
Spending habits (7 items)
Demographics (10 items)

Dataset Cleaning:

Common data cleaning steps:

1) dropna() - Used to drop the columns where any element is nan.
1) Get_dummies () - This helps in converting categorical variables into dummy/indicator variables
Sample Code:

responses4 = pd.get_dummies(columns=['Smoking', 'Punctuality', 'Lying','Alcohol', 'Internet usage', 'Gender', 'Left - right handed', 'Education', 'Only child', 'Village - town', 'House - block of flats'],data=responses)

III. Exploratory Analysis:

IV. Training Machine Learning Algorithms:

Following are the insights I plan to explore in this dataset:

Lonliness - What are the factors contributing in a person's feeling of Lonliness.
Life Struggles - Height and weight ? - Who is struggling more in their life.
Drinkers vs Non-drinkers - Predicting based on person's choices and character traits.

To do the analysis on the above mentioned areas the following Machine Learning Techniques:

Lonliness - Logistic Regression
Life Struggles - Relationship with Height and Weight ? - Logistic Regression
Drinkers and Non-drinkers - Decision Tree

Logestic Regression - Loneliness:

In this analysis, I am trying to use logistic regression to check wether I can get an interesting insight related to Loneliness. Also, I would like to see which all features can be attributed towards Loneliness.
For this analysis the following steps were performed:

Null values were removed. This step was done as a common step at the top.
The dummies variables were created from the categorical variables.
Creating two data frames males and females for the purpose of visualising

Result:

How to predict a young person’s loneliness?:

Firstly I transformed our original loneliness variable to a dummy variable.
Loneliness > 3 are considered as lonely.
Loneliness<=3 are considered not lonely.
Using logistic regression model with 71 variables, I can predict a person’s loneliness level at an accuracy rate of 71% with (K-fold) test sets.
The variables with largest coefficients can be found in above figure.
Pepople who like writing, have fear of public speaking, surfs a lot on internet,enjoys using PC & spends a lot on gadgets can have high degree of Loneliness whereas if they like spending time with friends, likes active sports, geography, pets & entertainment are negatively related with Loneliness

Logestic Regression - Life Struggles:

In this analysis, I am trying to use logistic regression to check wether we can get an interesting insight related to Life Struggles.
Also,I would like to see which all features can be attributed towards Life struggles.
For this analysis the following steps were performed:

Null values were removed. This step was done as a common step at the top.
The dummies variables were created from the categorical variables.
Finding the highest correlations.
Trying to find the reasons behind those correlations.

Result:

The most correlated features are Height and Weight which are self-explanaroty.
People who are interested in Biology are also interested in Medicine and Chemistry.
The same for Fantasy/Fairy tales and Animated movies.
one might ask why there is a negative correlation between Life struggles, Weight and Height. Let's explore further to understand the reasons behind this correlation.
If i observe the correlations of Life struggles, and hight and weight, I will see that these three variables are highly correlated with gender
If you have more life struggles in life, you are probably a woman :) Of course, if you like shopping, you are more likely to be a woman.
But if you are tall, or weigh a lot or like "PC Software, Hardware" then you are probably a man.
Negative correlation between Life struggles and height/weight was due to the female and male separation.

Decision Tree:

In this analysis, I have tried a decision tree as a machine learning technique to check wether I can get an interesting insight or not related to drinkers and non-drinkers.
For this analysis the following steps were performed:

Null values were removed. This step was done as a common step at the top.
The dummies variables were created from the categorical variables.
A function is written to replace the string with numbers in the field Alcohol and store it in a new column Alcohol2.
A decision tree with depth 3 was created.

Result:

It was surprising to find that teenagers of age 15-16 tend to drink more even though they are not legally allowed to do so. * Underage drinking is a serious public health problem in the society.
According to a study, Britain has the fourth highest levels of underage drinking among 15-years-olds.
Parents and public welfare organizations can work together to fight this issue by focusing on the factors in the above analysis.

V. Conclusion:

Loneliness: The Co-op Foundations and charitable organizations working to prevent and tackle youth loneliness can take these factors into consideration to help young people form stronger connections within their community.

Life Struggles: Please take care of females! It is the moral responsibility of everyone to treat women with respect and dignity as they already have a lot of struggles.

Drinkers vs Non-drinkers: It was surprising to find that teenagers of age 15-16 tend to drink more even though they are not legally allowed to do so. Underage drinking is a serious public health problem in the society. According to a study, Britain has the fourth highest levels of underage drinking among 15-years-olds. Parents and public welfare organizations can work together to fight this issue by focusing on the factors in the above analysis.

VI. Future Enhancements:

VII. References:

https://www.kaggle.com/miroslavsabo/young-people-survey
https://rstudio-pubs-static.s3.amazonaws.com/263733_b879f33aa4ac4499a68268845dc774d8.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Young People Survey

I. Introduction:

II. Data Preparation:

III. Exploratory Analysis:

IV. Training Machine Learning Algorithms:

V. Conclusion:

VI. Future Enhancements:

VII. References:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Young People Survey

I. Introduction:

II. Data Preparation:

III. Exploratory Analysis:

IV. Training Machine Learning Algorithms:

V. Conclusion:

VI. Future Enhancements:

VII. References: