Udacity DataScientist Nanodegree
Andrzej Wodecki, 08.2019
The goal of this project is to:
- analyze disaster massages data provided by FigureEight
- create a model for classification of new incoming messages into a set of pre-defined categories
- create a web app displaying key characteristics of data provided in a dataset and enabling an emergency worker to classify a new message.
There are 3 main components of the project:
- an ETL (Extract, Transform, Load) pipeline stored in a 'data' subfolder
- a modelling component, where a preprocessed data is used to fit and evaluate a final model ('model' subfolder)
- a web app, which display both data and a classification engine online ('webapp' subfolder).
data/process_data.py file is used to:
- load and merge the 'messages' and 'categories' datasets
- perform necessary cleaning and transformations
- store the resulting dataframe in a SQLlite database file
model/train_classifier.py file is a real heart of the solution. The machine learning pipeline implemented there:
- Loads data from a database
- Splits the data into training and test datasets
- Fits the model (applying GridSearchCV)
- Evaluates the final model
- Exports it as a pickle file.
This final component uses Flask to generate a website enabling an emergency worker to classify a new message. It is stored in a webapp subfolder and consists of:
- run.py app performing necessary data operations, generating figures and rendering a final website
- two templates stored in templates subfolder: master.html with a main page and it's extension (go.html) displaying new message classification results.
To run the app:
- Run the ETL pipeline:
- go to data folder
- type
python process_data.py disaster_messages.csv disaster_categories.csv disaster.db
to run process_data.py, read-in csv files and finally store them into disaster.db SQLlite file.
- Run the ML (Machine Learning) pipeline:
- go to model folder
- type
python train_classifier.py ../data/disaster.db model.pkl
to execute a ML pipeline, taking a disaster.db as input and storing a final model into a model.pkl file (pickle).
- Finally, run the web app:
- go to app folder
- run
python run.py
and follow the on-screen instruction (just open http://0.0.0.0:3001 in Your browser).
You will need:
Flask==1.0.2 nltk==3.4 numpy==1.15.4 pandas==0.22.0 plotly==3.4.2 scikit-learn==0.20.1 SQLAlchemy==1.2.14
- Udacity.com: for a great idea for the project, and a 'starter' pack (useful scripts)
- FigureEight.com for very good datasets.