Skip to content

Latest commit

 

History

History
88 lines (73 loc) · 5.11 KB

File metadata and controls

88 lines (73 loc) · 5.11 KB

Classification and Detection of Disaster Tweets

About

Natural Disasters have caused an average of 60,000 deaths worldwide. When Natural Disaster strike, many that have witnessed it would often report it on social media in real time which can be done through twitter or facebook. Many often seek news from social media as it is much faster than traditional media. Since people would report it on social media, there is a need for fast response from the rescue operators to respond to the disaster. However, there is currently no system in place to alert the rescue operators about a disaster that is posted on social media.

The goal of this project is to identify tweets that are deemed as a Disaster Tweet through the use of Machine Learning

In order to achieve the goals set out, we will need to:

  • Find a suitable dataset
  • Clean the dataset
  • Find a suitable model for training
  • Implement the idea (through a website)

mini-project

Full Presentation

Details Link
Presentation Video Click Here
Full Code in Google Colab Click Here
Website Click Here

For detailed walkthrough, please view the source code in order from:

  1. Data Extraction and Data Cleaning
  2. Data Visualization and Data Pre-processing
  3. Dense Network, LSTM and Bi-LSTM
  4. Comparison and Other Methods
  5. Validation of the Model
  6. Website

Dataset used

We used this dataset provided by Kaggle for our project

Models used

  • Dense Network
  • Long Short-Term Memory (LSTM) Network
  • Bi-directional LSTM Network

Conclusion

  • Between Dense Network, LSTM Model, and Bi-Directional LSTM, Dense Network has the highest accuracy.
  • Data overfitting would slightly reduce accuracy of the model
  • With close to 80% accuracy, our model did well on the classification most of the time.
  • However, when testing it on a new set of data, the accuracy of our dataset have dropped to approximately 50%.
  • While, it is shown that our test dataset have learn some universal features, but the drop in accuracy was not what we expected.
  • The drop in accuracy may be caused by the difference in datasets. Each datasets have its own unique features.
  • Also, this may suggests that our training dataset may be unrepresentative of the large pool of datasets. Hence, showing the limiting factor of our dataset where we do not have a broad domain.

Future improvement

  • On a larger scale, we would like to try using the Bidirectional Encoder Representations from Transformers (BERT) model
  • Further tunes our hyperparameter
  • Generalise our dataset to prevents any bias or unrepresentative datasets

Takeaways

  • Data Cleaning
    • Using regex to remove unwanted characters
    • Fixed the imbalanced dataset to prevent inaccuracy of data
  • Data Visualisation
    • Prepare and visualise our data using wordcloud
  • Data Pre-processing (Text Processing)
    • Use of tokenization
    • Use of sequencing
    • Use of padding
  • Machine learning
    • Dense Network using keras
    • Long Short-Term Memory (LSTM) Network
    • Bi-directional LSTM Network
  • Website
    • Use of Streamlit
    • Deploy in both localhost and home server to test run

References

Contributors

  • woonyee28 - Website, Implementation and Setup of Idea
  • Baby-McBabyFace - Data Cleaning, Data Visualization, Data Pre-processing
  • keenlim - Machine Learning Models, Comparison of data