Predicting user behaviour is one of most widespread uses of machine learning, it powers the recommendation engines on every possible service today.
Basic idea was to take user data, feed it to an unsupervised algorithm to find patterns in the data, and based on that – categorise users. Result would be user database where each user is placed in a specific category with other users that fit the same criteria. Finally, when all existing users fit into some category (they are labelled), new users can be classified in one of these categories based on their sign-up data.
Project was made with a couple of Python libraries: somoclu, sklearn, seaborn.
This project is a practical part of my bachelor’s thesis. Goal was to show the usage of machine learning models in predicting consumer behaviour. In order to achieve this, I used Emergent Self-Organizing Maps and K-Means clustering for data labelling and then K-Nearest Neighbours classifier to classify new data based on labelled data. The result was a proof-of-concept solution for recommending products to new customers based on no previous purchases.
Data used in this project was found on Kaggle, dataset name is – Black Friday. It contains user data like age, gender, occupation, marital status and also their purchase data like: bought product ID and amount spent.
Machine learning pipeline made in this project can be seen in a diagram bellow. Core part of the system is data labelling process, it uses ESOM and K-Means algorithms to find patterns and cluster the data. This technique is described in a paper by Ultsch, et al., 2005.
K in K-Means is decided using Elbow method as described in this article. After few trial runs, optimal number of categories was decided to be 6.
Models used:
-
Emergent Self-Organizing Map (ESOM) - unsupervised neural network, based on Self-Organising Maps, used primarily for visualizing high dimensional data. In this case it is used in a way to recognise underlying structures in the data in a form of a heat map so that clustering algorithms can be run on the network.
-
K-Means Clustering - one of the most popular unsupervised ML algorithms. In this project it is used for clustering the nodes of ESOM neural network.
-
K-Nearest Neighbours (KNN) - very simple classifier. In this project it is used for classifying new customers based on existing, newly labelled customers.
It is important to note that KNN classifier can be exchanged for any other classifying algorithm. I just used KNN because of its simplicity and small dataset size.
First image below shows heat map of data density generated by ESOM network. Image 2 shows each found cluster coloured in different colour. And finally, Image 3 shows in which cluster do first 100 users belong.
Result of the data labelling was 6 unique user categories, each containing users with similar demographics. For example, in image bellow we can see that ESOM and K-Means recognised one category to be all adult women (Image 1). Other 2 images show “Young adult men” category (Image2) and “Men over 45” category.