This is the code for our papers
- TinyHAR: A Lightweight Deep Learning Model Designed for Human Activity Recognition ( ISWC 2022 Best Paper Award)
- Improving HAR Models by Learnable Sparse Wavelet Layer
Sensor streams can only be represented in abstract ways and the recorded data typically cannot be interpreted easily by humans. This problem leads to difficulties in post-hoc annotation, which limits the availability and size of annotated Hunman Activity Recognition (HAR) datasets. Given the complexity of sensor-based HAR tasks, such large datasets are typically necessary to apply SOTA machine learning. Although Deep Learning (DL) models have shown extraordinary performance on HAR tasks, most DL models for HAR have large sizes (numerous trainable network parameters). When available data is limited, overly large network parameters make the model prone to overfitting, limiting or even jeopardizing its generalization performance. The second challenge arises from the fact that wearable devices that are intended to use the HAR model typically have limited resources. As a result, an excessive number of network parameters complicates the deployment of such models on end devices.
To address these challenges, it is desirable to design an efficient and lightweight deep learning model. By reviewing related work, we found only few works that considered designing a lightweight HAR model. To this end, we propose an efficient and lightweight DL model which has small model size and low inference latency.
TinyHAR: A Lightweight Deep Learning Model Designed for Human Activity Recognition
Zhou, Y.; Zhao, H.; Huang, Y.; Hefenbrock, M.; Riedel, T.; Beigl, M.
2022. International Symposium on Wearable Computers (ISWC’22) , Atlanta, GA and Cambridge, UK, September 11-15, 2022, Association for Computing Machinery (ACM). doi:10.1145/3544794.3558467
TBD
Network Design (see paper for details)
Designing an optimal, lightweight DL model requires careful consideration of the characteristics of target tasks and the factors which could reduce the inference time and operations number. Based on these two considerations, we developed the following guidelines to design lightweight HAR models:
- G1: The Extraction of local temporal context should be enhanced.
- G2: Different sensor modalities should be treated unequally.
- G3: Multi-modal fusion.
- G4: Global temporal information extraction
- G5: The temporal dimension should be reduced appropriately
- G6: Channel management, from shallow to deep
To enhance the local context, we applied a convolutional subnet to extract and fuse local initial features from the raw data (G1). Considering the varying contribution of different modalities, each channel is separately processed through four individual convolutional layers (G2). For each convolutional layer, ReLU nonlinearities and batch normalization~\cite{batchnorm} are used. Individual convolution means that the kernels have only 1D structure along the temporal axis (the kernel size is
Previous work [1] successfully adopted self-attention mechanism to learn the collaboration between sensor channels. Inspired by this, we utilized one transformer encoder block~\cite{model:transformer} to learn the interaction, which is performed across the sensor channel dimension (G2) at each time step. The transformer encoder block consists of a scaled dot-product self-attention layer and a two-layers Fully Connected (FC) feed-forward network. The scaled dot-product self-attention is used to determine relative importance for each sensor channel by considering its similarity to all the other sensor channels. Subsequently, each sensor channel utilized these relative weights to aggregate the features from all the other sensor channels.
Then the feed-forward layer is applied to each sensor channel, which further fused the aggregated feature of each sensor channel. Until now, the features of each channel are contextualized with the underlying cross-channel interactions. %After this stage, the shape of the data remains the same.
In order to fuse the learned features from all sensor channels (G3), we first vectorize these representations at each time step,
After the features are fused across sensor and filter dimension, we obtain a sequence of refined feature vectors
Given that not all time steps equally contribute to recognition of the undergoing activities, it is crucial to learn the relevance of features at each time step in the sequence. Following the work in~\cite{model:attnsense}, we generate a global contextual representation ${\vc \in \mathbb{R}^{F^}}$ by taking a weighted average sum of the hidden states (features) at each time step. The weights are calculated through a temporal self-attention layer. Because the feature at the last time step ${\vx_{T^} \in \mathbb{R}^{F^}}$ has the representation for the whole sequence, the generated global representation ${\vc}$ is then added to the ${\vx_{T^}}$. Here, we introduce a trainable multiplier parameter