Code Monkey home page Code Monkey logo

bert-long-sentence-classification's Introduction

BERT for long sentence classifiaction

BERT does not process tokenized sequences of text with more than 512 word pieces, it has to truncate them.

In the case of corpus like 20 Newsgroups, this represents a problem, because, it has a lot of extensive examples.

Recurrences Over BERT (RoBERT)

In this project, we implemented the approach proposed in this article Hierarchical Transformers for Long Document Classification.

RoBERT can process tokenized sequences of text for every size:

  1. Segements the text sequence in segments of N tokens.
  2. Tokenizes all the segments.
  3. Processes with BERT all the segments.
  4. The representation obtained with BERT for each segments is located sequentially in a tensor.
  5. The tensor will be processed by LSTM.
  6. The representation obtained in the last time step of the LSTM will be used to classify.

Dependencies

  • Python 3.7
  • We need the following packages (using pip):
pip install pandas
pip install cleantext
pip install scikit-learn
pip install torch
pip install transformers
pip install matplotlib
pip install mlxtend
pip install seaborn
pip install Unidecode
pip install nltk

Usage

The two command below use the argument True for downloading the 20 Newsgroups corpus (only it is necessary for the first execution of each script).

The first script uses BERT for sequence classication (BERTSC), therefore truncates the sentences.

./launch-experiments-20newsgroups.sh True

The second script uses RoBERT.

./launch-hierarchical-experiments-20newsgroups.sh True

Results

The accuracy obtained in the reference paper in the corpus 20 News Groups using RoBERT on the full datasets is 84 %. In our case, for limitations of hardware, we could not feed all the segmented corpus in the GPU, for this resason, we realize an experimentation with a reduced version, where the maximum number of tokens allowed by example was 512 and 1024 tokens.

In the table below are shown the results obtained, where for the same maximum lenght, BERTSC obtained a better performance than RoBERT, however, we implemented our own implementetation of RoBERT and does not follow the same approach of optimization, however, is obviuous that the LSTM degrade a bit the performance.

MAX. LENGTH MODEL ACC
1024 RoBERT 77 %
512 BERTSC 79 %
512 RoBERT 75 %

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.