Natural Disasters have caused an average of 60,000 deaths worldwide. When Natural Disaster strike, many that have witnessed it would often report it on social media in real time which can be done through twitter or facebook. Many often seek news from social media as it is much faster than traditional media. Since people would report it on social media, there is a need for fast response from the rescue operators to respond to the disaster. However, there is currently no system in place to alert the rescue operators about a disaster that is posted on social media.
The goal of this project is to identify tweets that are deemed as a Disaster Tweet through the use of Machine Learning
In order to achieve the goals set out, we will need to:
- Find a suitable dataset
- Clean the dataset
- Find a suitable model for training
- Implement the idea (through a website)
Details | Link |
---|---|
Presentation Video | Click Here |
Full Code in Google Colab | Click Here |
Website | Click Here |
For detailed walkthrough, please view the source code in order from:
- Data Extraction and Data Cleaning
- Data Visualization and Data Pre-processing
- Dense Network, LSTM and Bi-LSTM
- Comparison and Other Methods
- Validation of the Model
- Website
We used this dataset provided by Kaggle for our project
- Dense Network
- Long Short-Term Memory (LSTM) Network
- Bi-directional LSTM Network
- Between Dense Network, LSTM Model, and Bi-Directional LSTM, Dense Network has the highest accuracy.
- Data overfitting would slightly reduce accuracy of the model
- With close to 80% accuracy, our model did well on the classification most of the time.
- However, when testing it on a new set of data, the accuracy of our dataset have dropped to approximately 50%.
- While, it is shown that our test dataset have learn some universal features, but the drop in accuracy was not what we expected.
- The drop in accuracy may be caused by the difference in datasets. Each datasets have its own unique features.
- Also, this may suggests that our training dataset may be unrepresentative of the large pool of datasets. Hence, showing the limiting factor of our dataset where we do not have a broad domain.
- On a larger scale, we would like to try using the Bidirectional Encoder Representations from Transformers (BERT) model
- Further tunes our hyperparameter
- Generalise our dataset to prevents any bias or unrepresentative datasets
- Data Cleaning
- Using
regex
to remove unwanted characters - Fixed the imbalanced dataset to prevent inaccuracy of data
- Using
- Data Visualisation
- Prepare and visualise our data using
wordcloud
- Prepare and visualise our data using
- Data Pre-processing (Text Processing)
- Use of
tokenization
- Use of
sequencing
- Use of
padding
- Use of
- Machine learning
Dense Network
using kerasLong Short-Term Memory (LSTM)
NetworkBi-directional LSTM
Network
- Website
- Use of
Streamlit
- Deploy in both localhost and home server to test run
- Use of
- https://www.kaggle.com/competitions/nlp-getting-started/data
- https://www.kaggle.com/datasets/phanttan/disastertweets-prepared
- https://towardsdatascience.com/beginners-guide-for-data-cleaning-and-feature-extraction-in-nlp-756f311d8083
- https://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots
- https://www.datacamp.com/community/tutorials/wordcloud-python
- https://towardsdatascience.com/nlp-preparing-text-for-deep-learning-model-using-tensorflow2-461428138657
- https://www.analyticsvidhya.com/blog/2021/06/nlp-sentiment-analysis/
- https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/
- https://towardsdatascience.com/nlp-spam-detection-in-sms-text-data-using-deep-learning-b8632db85cc8
- https://ourworldindata.org/natural-disasters
woonyee28
- Website, Implementation and Setup of IdeaBaby-McBabyFace
- Data Cleaning, Data Visualization, Data Pre-processingkeenlim
- Machine Learning Models, Comparison of data