Code Monkey home page Code Monkey logo

question-classification's Introduction

Question classification

To run this script you need to have GloVe pre-trained word vector in text format in data folder. The file has to have number of lines and number of vector dimensions separated by space on the first line. The script searches for file glove.6B.50d.txt (for now). Before you can run the script you need to prepend line to glove.6B.50d.txt with 2 numbers. First is the number of lines in the file and the second is number of dimensions of the vector. For this task you can use the script:

./prepare-file.sh data/glove.6B.50d.txt 50

First parameter of the script is the file name and the second one is the number of dimension of the vector.

To run tests you should type command:

ipython notebook

and then select classify-questions.ipynb notebook

Results

The accuracy on test data set is around 72%. The question vector is computed as an average from first two words in the question. If the first two words are "what is" then the question vector is computed as an averge over all words. The questions starting with "what" or "what is" are hard to classify because it would need some more information about which word in question is relevant for its type.

Classification using LAT features

With question LAT features, the average accuracy with cross validation is around 82%. The notebook testing this type of classification is called classify-from-features.ipynb. The accuracy on test data set is around 86.4%.

The classifier combined from sparse feature vector, 4 word vectors (first word, second word, support verb, average over LAT fetures) plus support verb presence flag resulted into 89.8%.

Fine class labels

Training question classifier with fine lables instead of coarse one resulted into 75% accuracy on test data set. Combination of sparse feature vector and four word vectors resulted into 80.4% accuracy.

DATASETS

The pretrained word vectors can be downloaded from this site:

http://nlp.stanford.edu/projects/glove/

In this project, we use the one from Wikipedia 2014 with 50 dimensions.

The training and testing dataset can be downloaded from here:

http://cogcomp.cs.illinois.edu/Data/QA/QC/

question-classification's People

Contributors

pasky avatar pichljan avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.