o-github-o / dcnn Goto Github PK
View Code? Open in Web Editor NEWThis project forked from cosmmb/dcnn
This project forked from cosmmb/dcnn
This is the code for our ACL 2015 paper "Dependency-based Convolutional Neural Networks for Sentence Embedding". Thanks to Yoon Kim for sharing his code and giving suggestions to us for this project. Our project is extended based on the code for his paper "Convolutional Neural Networks for Sentence Classification" (EMNLP 2014). We post this project with Yoon's permission. You are welcome to adapt and optimize our project, but please do not use our code for commercial purpose. This code runs on python 2.66 and Theano 0.7. Our model is purely based on words. There is no POS tag information included in our model. There are many ways to improve the performance by including tag info. The most simplest way is treat tag as words and include the tags in convolution. Another way is to use different convolution filters (w in paper) for the words with different tags. You are welcome for discussing or collaborating in the extension with us. The paper can be found : http://people.oregonstate.edu/~mam/pdf/papers/DCNN.pdf This version only contains tree+sib model, and this can be easily extend to tree+sib+seq model. file description: 1. folder "TREC" contains the TREC dataset with 6 categories. Data is from here: http://cogcomp.cs.illinois.edu/Data/QA/QC/ . "TREC_all.txt" is the original data. After we parsed the TREC data set with Stanford parser, we get "TREC_all_parsed.txt". "label_all.txt" is the label for each sentence in "TREC_all.txt". 2. "preindex.py" reforms the sentence into a tree format from the parse file. 3. "process_TREC.py" is the file for text precessing. 4. "conv_net_classes.py" contains some basic function for CNN 5. "conv_sib_gpu.py" is our main function. 6. folder "data" is where you should put the word2vec binary file in order to let "process_TREC.py" works. You could find the file here: https://code.google.com/p/word2vec/ 7. "log_170.txt" is the accuracy for training, dev and testing set in each epoch. This result is generated by GPU. 170 means this is the result with 170 as batch size. For other training settings you can find in "conv_sib_gpu.py" Instruction: first step: download word2vec file and save it in "data" folder. second step: run "python process_TREC.py" ("preindex.py" will be run in this file). third step: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_sib_gpu.py 170 This code uses GPU(Tesla K80), but the code still works for CPU. If you want to test on cpu, you could change the above device=gpu to device=cpu and floatX=float32 to floatX=float64. Since there is a precision difference between gpu and cpu, the results will be slightly different in some cases. Compared with other hyperparameters, the performace of the model is relatively sensitive to batch_size and lr_decay. I would suggest to tune these two hyperparameter first. In our implementation, we use 10% of training data as dev set. We do not recycle the dev set to train the model again. Some people do this and I believe this will improve the performance. Mingbo Ma [email protected] EECS Oregon State University Sep 25 2015
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.