Code Monkey home page Code Monkey logo

jigsaw-toxic-2019's Introduction

Jigsaw Toxic 2019 Solution

A solution to the Jigsaw Unintended Bias in Toxicity Classification Kaggle competition.

Fine-tunes BERT and GPT-2 models on the training data with custom weighting schemes and auxiliary target variables.

Unfortunately I used a bugged evaluation metric function during the competition, and severely undermines the effort I put into this competition. I fixed the function and incorporated some of the custom weighting schemes shared by top competitors post-competition.

TODO: Try the renamed huggingface/pytorch-transformers (from huggingface/pytorch-pretrained-BERT) package and the new XLNet models.

Requirements

Unfortunately this project is not as well versioned all its dependencies like my last project ceshine/imet-collection-2019. But this time I included a Dockerfile that can replicate a working environment (at least at the time of writing, that is, July 2019).

Some peculiarity specific to this project:

  • pytorch-pretrained-BERT-master.zip is included and should be used via pip install pytorch-ptrained-BERT-master.zip, This is because the version that I used that lived on the project master branch never made it to PyPI. The latest PyPI version is not compatible with this project.
  • pytroch_helper_bot is included via git subtree to ease the cognitive load on user (it's not on PyPI yet, and I'm not planing to put it on).

Generally speaking, the essential dependencies of this project includes (besides the above two):

  • PyTorch >= 1.0
  • NVIDIA/apex (for reducing GPU memory consumption and speed up training on newer GPUs).
  • pandas

TODO: Write down the specific versions of major dependencies that are proven to work.

Kaggle Training and Predicting Workflow

I used almost exactly the same framework used by ceshine/imet-collection-2019. Only this time we don't need a separate validation Kernel. The validation scoring function/metric is integrated to the helperbot workflow.

I used a Kaggle Dataset toxic-cache to store tokenized training data, so the kernel won't need to re-tokenized the whole training set in every single run.

Google Colab Training

Example Colab Notebook: code is cloned directly from this Github repo, but the dataset, caching, and model weights live on Google Drive (you need to set it up in your account yourself).

jigsaw-toxic-2019's People

Contributors

ceshine avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.