Code Monkey home page Code Monkey logo

kaggle_challange_nlp_disaster_tweets's Introduction

Notebook NLP

Introduction


This notebook shows my approach for the classification of desaster tweets regarding the NLP Twitter kaggle challange.

My initial idea was to compare several BERT implementations with each other (BERT Uncased 1024, BERT Uncased 768, DistilBERT) to get a feeling how well they performe given the task of the kaggle challange. For this, reasearch was done by getting used to the BERT paper to understand the basic foundations of the algorithm as well as the tokenizer. This colab notebook which is trying to visualize some things was a great inspiration and helped a lot to setup my initial notebook.


This notebook is seperated into different sections, relating to the common data science workflow (except the hypothesis and data collection where already done).

  1. Setup and Check Infrastructure
  2. Having a first look at the Data (EDA)
  3. Helper Functions for Data Cleaning
  4. Data Cleaning (Feature Engineering)
  5. Pre-trained Bert Uncased model 1024 (& 768) (Model building 1/2)
  6. Pre-trained DistilBert model ((Model building 2/2)
  7. Showing Confusion Matrices for BERT models (to compare / evaluate)

Approach:

I started with common values mentioned in different notebooks on kaggle (see inspiration in first section) to get a feeling what influences what and how does it generelly work. After gaining some experience I combined several approaches from the notebooks and general information from the internet. The most promising was one relatively at the beginning with a kaggle score of 84.247% (userID: AlexS2020, rank: 116 at 14.09.2020). This involved a relatively simple notebook running on kaggle, using the parameters mentioned in the list in section 5. It was not possible for me to beat this afterwards, although I tried a lot of different ways and optimizations (added apprevations, cleaned only partly, left links in the tweets, emojis, ...).

Evaluation:

As will be mentioned in the last section, I used to compare the different val_acc and loss values of different models as well as uploading the submission files to get a direct comperision via the score. Last step to evaluate are the confusion_matrices to see how many false positive/negative and true postivie/negative.

Conclusion:

Overall it was an interesting project / competition which enabled me to learn new stuff and try out several tools. It is interesting to see, that the lightweight DistilBERT model is relatively competitive to the way larger BERT 1024 model and the scores do not differ that much. This gets even more interesting when comparing the runtime the 1024 model needs in an equal setup 1458 seconds per epoche, the DistilBERT just 248 seconds (with almost same results after one epoche, see the confusion matrices)!

Last learning for today: DistilBERT tends to overfit when using higher learning rates and more epochs way faster then the BERT 1024 model.

Outlook:

There are several ways how to improve these scores from my point of view. Some of them can be seen in other notebooks already (especially the larger ones).

  1. Run the notebook on larger hardware to play around with larger batch_sizes>16 and larger len_max>160.
  2. Adding dropouts to fight the overfitting
  3. Using different models behind each other as mentioned in the BERT paper to optimize output
  4. Modify the uncased model of BERT with more/different layers and play around with different activation functions, here for example: https://www.kaggle.com/sokolheavy/multi-dropout-aug-oof
  5. Write a script on local machine running different learning rates and other parameters so that different optimas can be found.
  6. An adaptive learning rate could also help with optimizing the results

Limitations:

  • Some of the points mentioned in Outlook were not possible since running code in browser + only 36h GPU-time per week is hardly enough
  • Automation of parameter search not feasible
  • Getting OOR erros quite offen due to limited GPU storage (its quite big for free use case, still...)

kaggle_challange_nlp_disaster_tweets's People

Contributors

alexsperka avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.