Code Monkey home page Code Monkey logo

disaster_response_project's Introduction

Alt text

Signal Storm: Leveraging Machine Learning to Identify Requests for Help During Natural Disasters

Project Overview

This code creates a machine learning pipeline that can be used to classify tweets sent during an emergency so that help can be sent from an appropriate agency. The project also includes a website where individuals can input new messages and get classification results in several categories.

Installation and Setup

Codes and Resources Used

  • Editor: VSCode
  • Python Version: 3.12.0

Python Packages Used

  • General Purpose: numpy, pandas
  • Data Manipulation: SQLAlchemy
  • Data Visualization: matplotlib, plotly
  • Natural Language Processing: nltk
  • NLTK Resources: punkt, averaged_perception_tagger, maxent_ne_chunker, wordnet
  • Machine Learning: scikit-learn, joblib
  • Web App: Flask, Bootstrap

Instructions

Note: If you're using a virtual environment, please make sure its activated before you run these commands.

  1. To set up the database and machine learning model, run the following commands:

    • To run ETL pipeline that cleans data and stores in database: python data/process_data.py data\01_raw\disaster_messages.csv data\01_raw\disaster_categories.csv data\02_stg\stg_disaster_response.db

    • To train the ML pipeline that trains the classifier on the base parameters and save the resulting model: python models\train_classifier.py data\02_stg\stg_disaster_response.db models\classifier.pkl The script will then issue the following prompts. Respond "yes", "no" or "exit":

      1. Decide whether to retrain the base model. If the user chooses to retrain, the script loads the base parameters, builds a model using these parameters, trains the model, evaluates it, and saves it to a pickle file.

      2. Decide to estimate the grid search runtime. If the user chooses to estimate, the script loads the grid search parameters and runs a grid search on a small subset of the data to estimate the runtime.

      3. Decide to run a full grid search. If the user chooses to run the grid search, the script runs the grid search, saves the results, and saves the best parameters found by the grid search.

      4. Decide to retrain the model using the optimized parameters found by the grid search. If the user chooses to retrain, the script loads the optimized parameters, builds a model using these parameters, trains the model, evaluates it, and saves it to a pickle file.

      WARNING: If you're running the pipeline locally, this might take a few minutes. The script will run use n-1 cores.

  2. To run the Flask app:

  • Go to app directory: cd app
  • Run the web app: python run.py
  • Copy http://127.0.0.1:3000 or the equivalent into your browser to view the app
    • Note: This is the local host, and is restricted to your local machine. The second address is the network address of your server which can be access from any machine on your local network.

Data

The model was built on a combination of the following two data sets:

  • disaster_messages.csv
    • Contains messages set during the disaster. Each message is labeled with one or more disaster-related categories, such as "water", "food", "medical help", etc.
    • Messages can be in a variety of languages.'original' messages are predominately in Haitian Creole that were translated into English. The corresponding note or English translation is in the 'message' column.
    • Messages are classified into the genres There are three values: direct, news and social
  • disaster_categories.csv
    • Contains the corresponding categories for each message in the disaster_messages dataset. Each category is represented by a binary value (0 or 1), indicating whether the message belongs to that category or not.
    • The 'related' column indicates if the message is related to the disaster or not. In the raw data, there are three possible values: 1 (related), 0 (not related) and 2 (ambiguous). The ambiguous messages have been dropped from the training set.

Model Design

The model is designed as a machine learning pipeline that processes text and classifies it into one of the 36 categories in the dataset. The pipeline consists of three main steps:

  1. Text Processing: The text data is first processed using a custom tokenize function from the nltk library. This function normalizes the case, lemmatizes and tokenizes the text. It also handles URL detection and replacement, punctuation removal and stop word removal.

  2. Vectorization and TF-IDF Transformation: The processed text is then vectorized using CountVectorizer with the custom tokenizer. After vectorization, a TF-IDF transformation is applied to the vectorized data.

  3. Multi-output Classification: The transformed data is classified using a RandomForestClassifier.

The trained model is saved to a pickle file for future use.

Tuning the Model for Accuracy

Here are the median values for the original model:

output_class precision recall f1-score
0 96 100 98
1 75 8 14
macro avg 85 54 57
weighted avg 96 96 95

I used GridSearchCV to tune the model for accuracy, and tested the following parameters:

Parameter Values
vect__ngram_range ((1, 1), (1, 2))
clf__estimator__n_estimators [50, 100, 200]
clf__estimator__min_samples_split [2, 3, 4]

This process resulted in the following 'optimized' values:

Parameter Original Value Optimized Value
vect__ngram_range (1, 1) (1, 2)
clf__estimator__n_estimators 100 200
clf__estimator__min_samples_split 2 2

Here are the median values for the optimized model:

output_class precision recall f1-score
0 96 100 98
1 78 4 7
macro avg 85 52 53
weighted avg 95 96 94

Here are the percent changes between the two model:

output_class precision recall f1-score
0 0.00 0.0 0.00
1 4.00 -50.0 -50.00
macro avg 0.00 -3.7 -7.02
weighted avg -1.04 0.0 -1.05

The data shows that the optimized model increase precision by 4% for relevant tweets, but decrease recall and the f1-score by 50%. This means that the optimized model is detecting positive cases more accurately, but at the expense of being able to detect all positive cases. This is not a trade that we want to make because we want to make sure that we're capturing as many true positive requests for help. Furthermore, the recall rates for both models are extremely low, highlighting a major drawback in prioritizing precision at a considerable cost to overall performance.

In addition, changing vect__ngram_range from (1, 1) to (1, 2) and clf__estimator__n_estimators from 100 to 200 increased training time from approximately one minute to seven minutes (a 700% increase in computational time). In addition, the optimized model is also 561.85 MB larger than the original model (when neither file is compressed).

Conclusion and Recommendations

Conclusion

The machine learning pipeline developed in this project demonstrates a promising approach to classifying disaster-related messages into 36 categories. However, the model's performance varies across different classes, with some classes achieving high precision at the cost of reduced recall. This trade-off is not ideal for our use case, as we aim to capture as many true positive requests for help as possible.

The model's performance was optimized using GridSearchCV, which significantly increased the computational time. While this resulted in improved precision for some classes, the overall F1-score, which balances precision and recall, and decreases for others. This suggests that the model's performance could be further improved.

Recommendations

  1. Optimize Grid Search for Weighted F1-Score instead of Accuracy: Accuracy is not always the best metric for evaluating a model's performance, especially for imbalanced datasets. Optimizing for the weighted F1-score, which considers both precision and recall, could lead to a more balanced model.

  2. Use a Translation API for Consistent Tweet Translations: The dataset contains messages in various languages, and the quality of translations can significantly impact the model's performance. Using a reliable translation API could ensure consistent and accurate translations.

  3. Consider Class Imbalance: Some classes in the dataset have significantly fewer samples than others, which can bias the model towards the majority classes. Techniques such as oversampling the minority classes or undersampling the majority classes could help address this issue.

  4. Feature Engineering: Additional features could be engineered from the text data to potentially improve the model's performance. For example, the length of the message, the number of words, or the presence of certain keywords could be useful features.

License

MIT License

disaster_response_project's People

Contributors

ghgeist avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.