Code Monkey home page Code Monkey logo

sentence-classification's Introduction

Sentence Classification

The goal of this project is to classify sentences, based on type:

  • Statement (Declarative Sentence)
  • Question (Interrogative Sentence)
  • Exclamation (Exclamatory Sentence)
  • Command (Imperative Sentence)

Each of the above broad sentence categories can be expanded and can be made more indepth. The way these networks and scripts are designed it should be possible expand to classify other sentence types, provided the data is provided.

This was developed for applications at Metacortex and is accompanied by a guide on building practical/applied neural networks on austingwalters.com.

Please, feel free to add PRs to update, improve, and use freely!


To Install

  • Install CUDA and CuDNN if you have a GPU (on your system of choice)
  • Install requirements (on python 3, python 2.x will not work)
pip3 install -r requirements.txt --user

To execute:

Pretrained model:

python3 sentence_cnn_save.py models/cnn

To build your own model:

python3 sentence_cnn_save.py models/<model name>

The models/ will load any pretrained model with said name, or retrain it.

See supplemental material for full guide.

Supplemental Material

This repository was created in conjunction with a guide titled Neural Networks to Production, From an Engineer.

Below is the guides table of contents:

Additional, (more complex models) are available in the advanced_modeling directory. Eventually, posts should come out of them.


Dataset

The dataset is created from parsing out the SQuAD dataset and combining it with the SPAADIA dataset.

The samples in the dataset:

  • Command 1111
  • Statement 80167
  • Question 131001

Note: Questions in this case are only one sentence, statements are a single sentence or more. They are classified correctly, but don't include sentences prior to questions.

Results

With the above, we are able to get the following accuracy:

Model Accuracy Train Speed Classification Speed
Dict 85% Fastest Fastest
CNN 97.80% Fast (185 μs/step) Very Fast (35 μs/step)
CNN (2-layer) 99.33% Fast (210 μs/step) Very Fast (42 μs/step)
MLP 95.5% Very Fast (60 μs/step) Very Fast (42 μs/step)
FastText (1-gram) 94.40% Fast (83 μs/step) Very Fast (26 μs/step)
FastText (2-gram) 95.59% Fast (196 μs/step) Very Fast (26 μs/step)
RNN (LSTM) 98.49% Very Slow (7000 μs/step) Very Slow (1000 μs/step)
RNN (GRU) 99.73% Very Slow (2000 μs/step) Very Slow (1000 μs/step)
CNN + LSTM 99.55% Very Slow (3000 μs/step) Very Slow (722 μs/step)
CNN + GRU 99.82% Very Slow (2000 μs/step) Very Slow (591 μs/step)
CNN + MLP 99.75% Slow (1000 μs/step) Fast (97 μs/step)

With some hyperparameter tuning:

Model Accuracy Train Speed Classification Speed
Dict 85% Fastest Fastest
CNN 99.40% Fast (200 μs/step) Very Fast (26 μs/step)
CNN (2-layer) 99.33% Fast (210 μs/step) Very Fast (42 μs/step)
MLP 95.5% Very Fast (60 μs/step) Very Fast (42 μs/step)
FastText (1-gram) 94.40% Fast (117 μs/step) Very Fast (26 μs/step)
FastText (2-gram) 95.59% Fast (196 μs/step) Very Fast (26 μs/step)
RNN (LSTM) 98.49% Very Slow (7000 μs/step) Very Slow (1000 μs/step)
RNN (GRU) 99.73% Very Slow (2000 μs/step) Very Slow (1000 μs/step)
CNN + LSTM 99.55% Very Slow (3000 μs/step) Very Slow (722 μs/step)
CNN + GRU 99.82% Very Slow (2000 μs/step) Very Slow (340 μs/step)
CNN + MLP 99.75% Slow (1000 μs/step) Fast (97 μs/step)

Computer Configuration:

  • GTX 1080
  • 32 Gb RAM
  • 8x 3.6 Ghz cores (AMD)
  • Arch Linux, up to date on 12/16/2018

CNN Hyperparameter tuning

Accuracy Speed Batch Size Embedding Dims Filters Kernel Hidden Dims Epochs
99.40% 26 μs/step 64 75 100 5 350 7
99.36% 40 μs/step 64 50 250 10 150 5
99.33% 25 μs/step 64 75 75 5 350 5
99.31% 59 μs/step 64 100 350 5 300 3
99.29% 25 μs/step 64 50 100 7 350 5
99.27% 62 μs/step 32 75 350 5 250 3
99.25% 25 μs/step 64 75 100 3 350 5
99.25% 25 μs/step 64 50 100 7 250 3
99.24% 53 μs/step 64 75 350 10 250 3
99.23% 56 μs/step 64 75 350 10 200 3
99.18% 36 μs/step 64 50 250 5 300 5
99.12% 52 μs/step 64 75 350 5 250 3
99.11% 22 μs/step 64 50 75 5 300 4
99.11% 26 μs/step 64 50 100 10 250 3
99.04% 62 μs/step 32 75 350 5 350 3
99.00% 24 μs/step 64 100 50 5 350 3
99.00% 52 μs/step 64 75 350 5 350 3
99.00% 40 μs/step 64 75 250 5 350 3
98.84% 50 μs/step 64 50 350 10 150 3
98.86% 40 μs/step 64 75 250 5 250 3
98.79% 26 μs/step 64 50 100 10 150 3
98.76% 30 μs/step 128 50 200 3 150 3
98.66% 31 μs/step 64 50 150 10 150 3
98.62% 45 μs/step 128 100 350 3 250 3
98.17% 19 μs/step 64 75 50 3 350 6
98.07% 34 μs/step 128 75 250 5 250 3
98.06% 45 μs/step 64 75 350 3 250 3
97.53% 35 μs/step 128 75 250 5 350 3
96.10% 32 μs/step 128 75 250 3 350 3

sentence-classification's People

Contributors

dancard avatar dependabot[bot] avatar lettergram avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sentence-classification's Issues

Issue with word_embeddings generated from encode_phrases

While using default word_encoding if a word is not present in dictionary key it is given an encoding 0. I saw your default_word_encoding.json and found multiple words have value 0 and so on. If it is intended can I know how model differentiates between new words and seen words with same encoded value?

data set with (train, val, test) splits

Hey there, Is it possible to get the final dataset with splits? I intend to train a transformer model for classifying questions vs statements. I can later create a pull request and that model can integrated to this wonderful repo of classifiers that you already have. Thanks!

Hyperparameter tuning code request

Hello,

Thank you for this awesome repo, giving me better understanding of different approach in text classification.
Please, is it possible to have access to the code used for the hyperparameter tuning?

Thank you.

Error : Inference on pre-trained model

Hi,
I was trying to test on pre-trained model(cnn). I successfully loaded your model with following commands :

  # load json and create model
   json_file = open(model_name + '.json', 'r')
   loaded_model_json = json_file.read()
   json_file.close()
   model = model_from_json(loaded_model_json)

Now, when I tried to test it through following code:


test_comments, test_comments_category = get_custom_test_comments()

x_test, _, y_test, _ = encode_data(test_comments, test_comments_category,
                                   data_split=1.0,
                                   embedding_name=embedding_name,
                                   add_pos_tags_flag=pos_tags_flag)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
y_test = keras.utils.to_categorical(y_test, num_classes)

score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)

^this last line of model.evaluate resulted in error :


InvalidArgumentError: indices[13,490] = 22271 is not in [0, 15000)
	 [[Node: embedding_1_9/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@dropout_1_9/cond/Switch_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1_9/embeddings/read, embedding_1_9/Cast, embedding_1_9/embedding_lookup/axis)]]

which I figured out that it might be because of word id which is contained in x_test, because value of max_words which we have is 15000 and maximum value in x_test is far greater than 15000 so it's not able to find words which have id greater than 15000. I tried to divide all the values of x_test by 100 and then converted all the values to integer. Then it successfully worked.

So , can you please suggest me If I am doing anything wrong, or any other word encoding needs to be loaded?
Thanks for the help.

Using pretrained model

Hi, I want to use the pre-trained model to classify my sentences. But I am not that familiar with Deep Learning.
Here I have some questions:

  1. is tensorflow==2.4.0 necessary?
  2. I have some sentences stored as txt. files, can I use them as inputs? If not, what should be the input while using the pre-trained model?

Counts printed by gen_test_comments include duplicates

The values being printed by gen_test_comments are

-------------------------
command 1672
statement 80993
question 131219
-------------------------

However, the actual values are 1111, 80167 and, 131001 for commands, statements and questions respectively. The values stored in the variables like command_count include duplicate sentences.
While the tagged_comments dict takes care of duplicate values, the counts still contain duplicates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.