Activity Context Recognition is a fundamental aspect of context-aware computing, focused on automatically inferring contextual information from sensor-generated observations. It involves the use of various techniques, including machine learning and data analysis, to analyse sensor data and classify or recognize different activities being performed by individuals. The goal of Activity Context Recognition is to enable systems to understand and respond to users' activities in real-time, allowing for personalised assistance, enhanced user experience, efficient resource management, safety and security, behavioural analysis, adaptive systems, and intelligent applications across domains such as healthcare, smart environments, and lifestyle monitoring.
A fitness company has engaged the services to create an intelligent model for their mobile fitness application, enabling automated recognition of users' activities. The company has provided labelled historical activity context data collected from individuals who have participated in the data collection process. The data encompasses readings from various smartphone built-in sensors, including magnetic, orientation, accelerometer, rotation, gyroscope, light and sound sensors. The objective is to develop an activity recognition model for the fitness system, utilising the activity context tracking dataset to analyse, design, implement, and evaluate the model.
To build the prediction/recognition model, three different Machine Learning Classifier Models are used: Multi-Layer Perceptron Neural Network, Decision Tree Classifier and Random Forest Classifier.
The necessary steps that are being performed for classification and prediction are mentioned below:
- EDA (Exploratory Data Analysis)
- Feature Extraction/Selection
- Activity Classification
A labelled historical activity context tracking dataset is used for the purpose of Exploratory Data Analysis, as it incorporates all the attributes that are critical for recognizing human activity. It consists of a total of 23496 observations for 19 different attributes. The various attributes include:
- _id: Id of the observation
- orX: Orientation value along the x-axis
- orY: Orientation value along the y-axis
- orZ: Orientation value along the z-axis
- rX: Rotation around the x-axis
- rY: Rotation around the y-axis
- rZ: Rotation around the z-axis
- accX: Acceleration value on the x-axis
- accY: Acceleration value on the y-axis
- accZ: Acceleration value on the z-axis
- gX: Rate of rotation around the x-axis
- gY: Rate of rotation around the y-axis
- gZ: Rate of rotation around the z-axis
- mX: Magnetic field around the x-axis
- mY: Magnetic field around the y-axis
- mZ: Magnetic field around the z-axis
- lux: Amount of light emitted
- soundLevel: Measure of the sound level
- activity: Activity done by the Human
To ensure the accuracy of machine learning models, it is crucial to address missing or null values, as they can introduce bias in results. Therefore, it is necessary to employ strategies for handling missing values before feeding the dataset into the machine learning or deep learning framework. The dataset does not contain any null values, as no data points are present in the heat map.
In order to perform the descriptive statistical analysis of the important features, multiple functions have been created. The table below shows the statistical information that has been found out:
- Data Standardisation
Imbalanced training data poses challenges for machine learning classification problems. The presence of class imbalance significantly biases the models towards the majority class, leading to reduced classification performance and an increased number of false negatives. To mitigate these issues and enhance classification performance, working with a balanced dataset is crucial. A balanced dataset ensures equal consideration of information for predicting each class, providing a more accurate understanding of how the model will respond to test data. Ultimately, this leads to improved classification performance. It is evident from the below plots that the target variable is not normally distributed.
Thus, it is necessary to carry out data standardisation and apply relevant techniques to extract meaningful information before feeding it to machine learning algorithms for context recognition, ensuring optimal performance and accurate predictions. Data standardisation is the process of transforming the data in a dataset to a common scale. It is a crucial preprocessing step that enhances the performance and stability of many machine learning algorithms. The most commonly used methods for data standardisation are: normal standardisation, z-score normalisation and min-max scaling. In this classification problem, the StandardScaler() function of the scikit-learn is used to standardise the data values into a standard format. The process of standardizing a dataset entails rescaling the distribution of values, aiming to achieve a standard deviation of 1 and a mean of 0 for the observed values.
- SMOTE - Synthetic Minority Oversampling Technique
This approach involves generating new data by inferring from existing data. Instead of deleting or copying data, current inputs generate distinct rows with labels derived from the implications of the original data.
After applying SMOTE on the standardised dataset, the number of rows in the dataset has been increased to 984529. This new dataset is then saved as a new csv file called “standardizedHACR.csv” and then imported for the classification problem wherein it is divided into training & testing sets to overcome the overfitting and underfitting of the training model.
During the model development process, the training set is employed for model training whereas the testing set comprises of data points that are used to assess the model's generalization to new/unseen data. The scikit-learn library’s train_test_split() method creates the training and testing sets with the test size parameter determining the proportion of the dataset allocated for testing, and the random_state parameter controlling the data shuffling before the split.
Feature Extraction/Selection is done to identify a subset of relevant features that have a strong relationship with the target variable, leading to improved model performance. It involves assessing the significance of features on model performance for selection purposes. In this dataset, by looking into the heatmap it can be found that the attribute “orZ” is highly correlated to accX.
Highly correlated features exhibit strong linear dependence and contribute almost equally to the dependent variable. Therefore, when two features display a high correlation, one of the features can be removed to avoid redundancy and multicollinearity. Multicollinearity poses challenges in determining the independent impact of each variable on the target variable.
Thus, removing highly correlated variables is a crucial step in data preprocessing and improves the performance, efficiency and accuracy of the machine-learning training model. In Python, the correlation matrix is used to identify the pairs of features that are highly correlated and then one of them is dropped from each highly correlated pair.
corr() - function from the pandas library to calculate the correlation matrix drop() - function to remove the features with high correlation from the dataframe.
The Feature Selection stage identifies the most relevant features, which are then utilized in conjunction with the latest balanced dataset for classification models. To classify the data, Machine Learning Algorithms like Decision Tree Classifier, Random Forest Classifier, and Multi-Layer Perceptron Neural Network are employed, using the newly normalized test sets. The effectiveness of these models is assessed by evaluating their performance on the test dataset, yielding confusion matrices for each model.
A confusion matrix is used to display and summarise the results of a classification algorithm.In case of strong imbalance in test data, there are different evaluation metrics such as Precision, Recall and F1 Score that provides insights apart from the accuracy metric.
Accuracy: It measures the proportion of correct predictions over the total predictions. However, it is not a reliable metric when dealing with unbalanced datasets. Precision: It should ideally be 1 (high) for a good classifier. Recall: It should ideally be 1 (high) for a good classifier. F1 Score: It encompasses both the precision (accurately identify positive instances) and the recall (capacity to capture all positive instances). It is commonly characterized as the harmonic mean of these two measures. . The performances of the models are compared in terms of accuracy, precision, recall & F1-Score.
- Accuracy of Decision Tree Classifier: 99%
- Accuracy of Random Forest Classifier: 100%
- Accuracy of Multi-Layer Perceptron Neural Network: 86%
Based on the results of the above model performances, it can be inferred that Random Forest Classifiers have performed the best following feature selection and the implementation of the SMOTE technique. As a result, the target variable in this classification problem can be predicted with exceptional accuracy of 100%.
The project aimed to improve the understanding of various data analytics and visualization libraries used during Exploratory Data Analysis (EDA) for classification purposes. It also focused on dealing with unbalanced datasets by eliminating null values, data standardization, and other techniques to balance it. This resulted in finding and combining different attributes/features and applying various machine learning algorithms to predict the target/output. The context recognition based on sensor data is challenging, especially when there are a variety of machine learning techniques. Human behavior is not only natural and spontaneous, but also humans may perform several activities at the same time or even carry out some unrelated activities. Therefore, it is advised that ACR systems should be designed in such a way that they identify concurrent activities, predict the speed of movement, and deal with uncertainties to achieve high accuracy and improve functionality, quality, and safety in their various applications across industries.
- Numpy - https://numpy.org/
- Pandas - https://pandas.pydata.org/
- Matplotlib - https://matplotlib.org/
- Seaborn - https://seaborn.pydata.org/
- Plotly - https://plotly.com/
- Scikit Learn - https://scikit-learn.org/
- Keras - https://keras.io/