Code Monkey home page Code Monkey logo

toxic-comments-classifier's Introduction

toxic-comments-classifier

This project focuses on developing and evaluating deep machine learning models for detecting toxic comments in online discussions.

Here's a brief summary of what the code does:

1- Data Loading and Exploration: The code starts by loading the training and testing datasets from CSV files. It explores the structure and content of the datasets, including checking the shape and displaying some examples.

2- Text Preprocessing: Text cleaning is performed on the comments, including lowercasing, removing patterns, and removing repeating characters. NLTK is used for tokenization, stopword removal, and lemmatization.

3- Exploratory Data Analysis (EDA): Word clouds are created for toxic and threat comments to visualize common words. Class distribution and positive label distribution in the datasets are analyzed and visualized.

4- Tokenization and Padding: Tokenization is applied to convert text into sequences of numbers. Padding is performed to ensure that all sequences have the same length.

5- Model Building: The code defines and trains several models, including Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks, Bidirectional GRU, and combinations of LSTM with CNN. Different architectures and configurations are experimented with, such as varying filter sizes in CNN layers and using GloVe embeddings.

6-Model Evaluation: The F1 score is used as the evaluation metric. Training history is plotted to visualize the model's performance over epochs.

7- BERT Model Training: The code loads a BERT model using TensorFlow Hub and builds a classifier on top of it. The BERT-based model is trained and evaluated.

8- Model Comparison: The F1 scores of different models (baseline, GloVe, BERT) are compared and visualized.

9- Model Prediction and Evaluation: The trained models are used to predict labels on the test set, and classification reports are generated.

10- Visualization: The code includes commented-out sections for visualizing the training progress and comparing different models using a bar chart.

Conclusion: The code provides a comprehensive approach to text classification, including preprocessing, model building, training, and evaluation.

toxic-comments-classifier's People

Contributors

haashhish avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.