This code creates a machine learning pipeline that can be used to classify tweets sent during an emergency so that help can be sent from an appropriate agency. The project also includes a website where individuals can input new messages and get classification results in several categories.
- Editor: VSCode
- Python Version: 3.12.0
- General Purpose: numpy, pandas
- Data Manipulation: SQLAlchemy
- Data Visualization: matplotlib, plotly
- Natural Language Processing: nltk
- NLTK Resources: punkt, averaged_perception_tagger, maxent_ne_chunker, wordnet
- Machine Learning: scikit-learn, joblib
- Web App: Flask, Bootstrap
Note: If you're using a virtual environment, please make sure its activated before you run these commands.
-
To set up the database and machine learning model, run the following commands:
-
To run ETL pipeline that cleans data and stores in database:
python data/process_data.py data\01_raw\disaster_messages.csv data\01_raw\disaster_categories.csv data\02_stg\stg_disaster_response.db
-
To train the ML pipeline that trains the classifier on the base parameters and save the resulting model:
python models\train_classifier.py data\02_stg\stg_disaster_response.db models\classifier.pkl
The script will then issue the following prompts. Respond "yes", "no" or "exit":-
Decide whether to retrain the base model. If the user chooses to retrain, the script loads the base parameters, builds a model using these parameters, trains the model, evaluates it, and saves it to a pickle file.
-
Decide to estimate the grid search runtime. If the user chooses to estimate, the script loads the grid search parameters and runs a grid search on a small subset of the data to estimate the runtime.
-
Decide to run a full grid search. If the user chooses to run the grid search, the script runs the grid search, saves the results, and saves the best parameters found by the grid search.
-
Decide to retrain the model using the optimized parameters found by the grid search. If the user chooses to retrain, the script loads the optimized parameters, builds a model using these parameters, trains the model, evaluates it, and saves it to a pickle file.
WARNING: If you're running the pipeline locally, this might take a few minutes. The script will run use n-1 cores.
-
-
-
To run the Flask app:
- Go to
app
directory:cd app
- Run the web app:
python run.py
- Copy http://127.0.0.1:3000 or the equivalent into your browser to view the app
- Note: This is the local host, and is restricted to your local machine. The second address is the network address of your server which can be access from any machine on your local network.
The model was built on a combination of the following two data sets:
- disaster_messages.csv
- Contains messages set during the disaster. Each message is labeled with one or more disaster-related categories, such as "water", "food", "medical help", etc.
- Messages can be in a variety of languages.'original' messages are predominately in Haitian Creole that were translated into English. The corresponding note or English translation is in the 'message' column.
- Messages are classified into the genres There are three values: direct, news and social
- disaster_categories.csv
- Contains the corresponding categories for each message in the disaster_messages dataset. Each category is represented by a binary value (0 or 1), indicating whether the message belongs to that category or not.
- The 'related' column indicates if the message is related to the disaster or not. In the raw data, there are three possible values: 1 (related), 0 (not related) and 2 (ambiguous). The ambiguous messages have been dropped from the training set.
The model is designed as a machine learning pipeline that processes text and classifies it into one of the 36 categories in the dataset. The pipeline consists of three main steps:
-
Text Processing: The text data is first processed using a custom
tokenize
function from the nltk library. This function normalizes the case, lemmatizes and tokenizes the text. It also handles URL detection and replacement, punctuation removal and stop word removal. -
Vectorization and TF-IDF Transformation: The processed text is then vectorized using
CountVectorizer
with the custom tokenizer. After vectorization, aTF-IDF
transformation is applied to the vectorized data. -
Multi-output Classification: The transformed data is classified using a
RandomForestClassifier
.
The trained model is saved to a pickle file for future use.
Here are the median values for the original model:
output_class | precision | recall | f1-score |
---|---|---|---|
0 | 96 | 100 | 98 |
1 | 75 | 8 | 14 |
macro avg | 85 | 54 | 57 |
weighted avg | 96 | 96 | 95 |
I used GridSearchCV to tune the model for accuracy, and tested the following parameters:
Parameter | Values |
---|---|
vect__ngram_range |
((1, 1), (1, 2)) |
clf__estimator__n_estimators |
[50, 100, 200] |
clf__estimator__min_samples_split |
[2, 3, 4] |
This process resulted in the following 'optimized' values:
Parameter | Original Value | Optimized Value |
---|---|---|
vect__ngram_range |
(1, 1) | (1, 2) |
clf__estimator__n_estimators |
100 | 200 |
clf__estimator__min_samples_split |
2 | 2 |
Here are the median values for the optimized model:
output_class | precision | recall | f1-score |
---|---|---|---|
0 | 96 | 100 | 98 |
1 | 78 | 4 | 7 |
macro avg | 85 | 52 | 53 |
weighted avg | 95 | 96 | 94 |
Here are the percent changes between the two model:
output_class | precision | recall | f1-score |
---|---|---|---|
0 | 0.00 | 0.0 | 0.00 |
1 | 4.00 | -50.0 | -50.00 |
macro avg | 0.00 | -3.7 | -7.02 |
weighted avg | -1.04 | 0.0 | -1.05 |
The data shows that the optimized model increase precision by 4% for relevant tweets, but decrease recall and the f1-score by 50%. This means that the optimized model is detecting positive cases more accurately, but at the expense of being able to detect all positive cases. This is not a trade that we want to make because we want to make sure that we're capturing as many true positive requests for help. Furthermore, the recall rates for both models are extremely low, highlighting a major drawback in prioritizing precision at a considerable cost to overall performance.
In addition, changing vect__ngram_range
from (1, 1) to (1, 2) and clf__estimator__n_estimators
from 100 to 200 increased training time from approximately one minute to seven minutes (a 700% increase in computational time). In addition, the optimized model is also 561.85 MB larger than the original model (when neither file is compressed).
The machine learning pipeline developed in this project demonstrates a promising approach to classifying disaster-related messages into 36 categories. However, the model's performance varies across different classes, with some classes achieving high precision at the cost of reduced recall. This trade-off is not ideal for our use case, as we aim to capture as many true positive requests for help as possible.
The model's performance was optimized using GridSearchCV, which significantly increased the computational time. While this resulted in improved precision for some classes, the overall F1-score, which balances precision and recall, and decreases for others. This suggests that the model's performance could be further improved.
-
Optimize Grid Search for Weighted F1-Score instead of Accuracy: Accuracy is not always the best metric for evaluating a model's performance, especially for imbalanced datasets. Optimizing for the weighted F1-score, which considers both precision and recall, could lead to a more balanced model.
-
Use a Translation API for Consistent Tweet Translations: The dataset contains messages in various languages, and the quality of translations can significantly impact the model's performance. Using a reliable translation API could ensure consistent and accurate translations.
-
Consider Class Imbalance: Some classes in the dataset have significantly fewer samples than others, which can bias the model towards the majority classes. Techniques such as oversampling the minority classes or undersampling the majority classes could help address this issue.
-
Feature Engineering: Additional features could be engineered from the text data to potentially improve the model's performance. For example, the length of the message, the number of words, or the presence of certain keywords could be useful features.