Light

pierreoutin / eutopia_task Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 30 KB

Text classification with BERT Tokenizer

Jupyter Notebook 100.00%

eutopia_task's Introduction

Why Bert Tokenizer :

Bert : Bidirectional Encoder Representations with Transformers
Bert predict a token by paying "attention" to every other token in the sequence.
We used a pre-trained model with transfer learning, train with a large pool of documents
Bert Tokenizer have good result on text classification

Dataset :

crunchbase_ID : Id of each website
home_text : home page text
aboutus_text : about us page text
overview_text : overview page text
whatwedo_text : what we do page text
company_text : company page text
whoweare_text : who we are page text
AI : 0 or 1, absence or presence of AI in website text

Data Cleaning :

Fill with empty string all NA values from all text except home_text
Merge all text column into a single one

Data preparation :

Import stopwords and lemmatizer from nltk
Split text into list of words
Remove stopwords and punctuation from list of words
Remove digit
Lemmatize each word, reduce the different forms of a word to one single form
Remove all words who appears only one time

Word Embedding :

Join all words from list
Tokenize all text with Bert Tokenizer
Convert tokens to ID
Creating list of lists with tokens ID , label and length of each text
Sort data by length of each text
Convert the sorted dataset into a TensorFlow input dataset shape
Pad our dataset for each Batch

Splitting Data :

We took 10% of data into test set

Creating Model :

We initialize some attributes with default values
We initialize three convolutional neural network layers with filter values
With call function, global max pooling is applied to the output of each of the convolutional neural network layer.
The first densely connected neural network is concatenation of the three convolutional neural network layers
The second densely connected neural network is used to predict if text contains AI.

Fitting model :

We pass the hyper parameters values that we defined in the last step to the constructor of the TEXT_MODEL class like embedding dimension of 200.
We use fit method to train our model for 10 epoch

Results :

Accuracy of 99.98% on the training set
Accuracy of 86.04% on the training set

Conclusion :

We can use BERT Tokenizer to create word embeddings that can be used to perform text classification.
In our case, we performed AI analysis of website text and achieved an accuracy of 86.04% on the test set.
I think it's can have better results for small or medium text, but for website page, Bert Tokenizer will take only the 512 first words/tokens and will trim the rest of the text.

eutopia_task's People

Contributors

Stargazers

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.