abhishekkrthakur / is_that_a_duplicate_quora_question Goto Github PK

Python 100.00%

machine-learning deep-learning quora classification

is_that_a_duplicate_quora_question's Introduction

is_that_a_duplicate_quora_question

all the code for the article https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur will be available here..

How To

Install Required Libraries

pip install pandas
pip install numpy
pip install scikit-learn
pip install nltk
pip install tqdm
pip install keras
pip install tensorflow
pip install pyemd
pip install fuzzywuzzy
pip install python-levenshtein
pip install --upgrade gensim

Download Required Language libraries

mkdir data
cd data
wget http://www-nlp.stanford.edu/data/glove.840B.300d.zip
unzip glove.840B.300d.zip
rm glove.840B.300d.zip
wget http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv
wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
sudo python -m nltk.downloader stopwords
cd ..

python feature_engineering.py
python deepnet.py

is_that_a_duplicate_quora_question's People

Stargazers

Watchers

Forkers

cosmic-chichu cngo-github sunjieee benjamesbabala peterbengkui saikswaroop ompanda o-github-o sohomghosh wuxiaobo zhangjiulong asadhanif adrianyu wyj2046 mathlf2015 chenglongchen jimmyyfeng 176coding matthiasjfrank priyankartalukdar busishu drjzhou vaddi79 she-huanbo bastinrobin webwahab erogol babylls wulidongdong tdhd laviavigdor ecsplendid shanealynn ashish9112 chiuyeelau sam2015 xiaoganghan seme0021 guitarmind jojo62000 avsolatorio davidfumo jibybabu sambid9988 jatinjindalj robi56 xinghudamowang aarthiis subedi90 vsooda spongebbob darraghdog paulantoine canoefzh techscientist mdiby qayyuum85 adegie deepvoltaire pandian4github snassimr ravigarg27 proudhoosier diwahars jeffzhengye ricky-return42 xwc940512 qiuhuigithub everyonelijin xudongyan thomasyang183 souravroy0708 vybhavk lilitom neoanarika durgaprasd root-master lanttern tongli12 juanlp hack-r piyush0609 dem-esgal dburner bharathvemula mraduldubey gaurav780 whitewinder jwuthri agreenpig jkhlot zencoding sanjeeku paliking laknath asdeveryone tjacowalvis thakkark1313 qgzang vivek-chalumuri

is_that_a_duplicate_quora_question's Issues

Not the final code?

This does not look to be the final version of your code.

The features file is created, but is not used by deepnet.py, for example.

Also, I've got deepnet.py generally running, but the validation loss drifts upward with accuracy. This is a mark of a model that is not big enough to store its info, and the data has high variance, so the model overtrains.

I just noticed that the data is not shuffled during prep.

Shows error on tensorflow backend

I am running the deepnet.py file but the following error is coming up-

Build model...

TypeError Traceback (most recent call last)
in ()
80 model5 = Sequential()
81 model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2))
---> 82 model5.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2))
83
84 model6 = Sequential()

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/models.pyc in add(self, layer)
453 output_shapes=[self.outputs[0]._keras_shape])
454 else:
--> 455 output_tensor = layer(self.outputs[0])
456 if isinstance(output_tensor, list):
457 raise TypeError('All layers in a Sequential model '

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/layers/recurrent.pyc in call(self, inputs, initial_state, **kwargs)
250 else:
251 kwargs['initial_state'] = initial_state
--> 252 return super(Recurrent, self).call(inputs, **kwargs)
253
254 def call(self, inputs, mask=None, initial_state=None, training=None):

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/engine/topology.pyc in call(self, inputs, **kwargs)
552
553 # Actually call the layer, collecting output(s), mask(s), and shape(s).
--> 554 output = self.call(inputs, **kwargs)
555 output_mask = self.compute_mask(inputs, previous_mask)
556

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/layers/recurrent.pyc in call(self, inputs, mask, initial_state, training)
288 'or batch_shape argument to your Input layer.')
289 constants = self.get_constants(inputs, training=None)
--> 290 preprocessed_input = self.preprocess_input(inputs, training=None)
291 last_output, outputs, states = K.rnn(self.step,
292 preprocessed_input,

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/layers/recurrent.pyc in preprocess_input(self, inputs, training)
1031 self.dropout, input_dim, self.units,
1032 timesteps, training=training)
-> 1033 return K.concatenate([x_i, x_f, x_c, x_o], axis=2)
1034 else:
1035 return inputs

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.pyc in concatenate(tensors, axis)
1525 return tf.sparse_concat(axis, tensors)
1526 else:
-> 1527 return tf.concat([to_dense(x) for x in tensors], axis)
1528
1529

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.pyc in concat(concat_dim, values, name)
1073 ops.convert_to_tensor(concat_dim,
1074 name="concat_dim",
-> 1075 dtype=dtypes.int32).get_shape(
1076 ).assert_is_compatible_with(tensor_shape.scalar())
1077 return identity(values[0], name=scope)

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.pyc in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
667
668 if ret is None:
--> 669 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
670
671 if ret is NotImplemented:

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.pyc in _constant_tensor_conversion_function(v, dtype, name, as_ref)
174 as_ref=False):
175 _ = as_ref
--> 176 return constant(v, dtype=dtype, name=name)
177
178

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.pyc in constant(value, dtype, shape, name, verify_shape)
163 tensor_value = attr_value_pb2.AttrValue()
164 tensor_value.tensor.CopyFrom(
--> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
167 const_tensor = g.create_op(

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.pyc in make_tensor_proto(values, dtype, shape, verify_shape)
365 nparray = np.empty(shape, dtype=np_dt)
366 else:
--> 367 _AssertCompatible(values, dtype)
368 nparray = np.array(values, dtype=np_dt)
369 # check to them.

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.pyc in _AssertCompatible(values, dtype)
300 else:
301 raise TypeError("Expected %s, got %s of type '%s' instead." %
--> 302 (dtype.name, repr(mismatch), type(mismatch).name))
303
304

TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

I am using keras 2.0.1 on tensorflow-gpu 0.12.1.

why is the embedding matrix not passed for embedded layer weights in model 5 & 6?

Why are we learning the embeddings in model 5 & 6 and not passing the glove embeddings?

Can't achieve your base line

Hello,
thanks for your notes
I'm trying to implement your baseline (using features with simple logistic regression classifier), I'm using the fs3-2 part of features.

I can't find any line in your code to calculate tf-idf of question's words
as I calculate it eb myself and also do not calculating I can't achieve the accuracy that you mentioned in your report (your said that it is 0.804 with logistic regression) can you share your classifier code or at least its parameters?

I'm looking to find what is my mistake.
thank you in advance.

How do you make predictions?

Thanks for sharing this repo. I've learned a great deal from it!

I'm new to deep learning and using word embeddings in a neural net, and I'm having issues making predictions once I've trained the model. I keep getting a no element found in index or some such error, which I assume is caused by a failed lookup because some words in the test dataset may not exist in the embedding matrix. I arrived at this assumption because predictions work fine if I use the training dataset.

I can provide the exact error message, but I'm just wondering if you can think of something off the top of your head.

Thanks.

how to load dnn net file？

hello！
when learning the keras code "deepnet.py", i'm confused to file 'weights.h5'. i don't know how to reload the file, when use the model again.

If we do not use neural network, how many logloss can we achieve?

and how many dim of features for the machine learning model without neural network?
Thank you.
@abhishekkrthakur

load_word2vec_format error

@abhishekkrthakur
thanks for you sharing. But I find the below code is error.[OSError: Not a gzipped file (b've')] :
model=gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

Keras 2.0 compatibility

The code doesn't work with Keras 2 API. It needs to be updated to support the new API.

Where to use quora_features from feature_engineering?

quora_features.csv is the last line in feature_engineering.py.
But deepnet.py do not use it.
Thank you!
@abhishekkrthakur

What kind of feature is suitable for this scenery? (deal with synonym)

Train data: Who is Michael Jordan? and Who is Air Jordan? are one right match.
The test data is Who is Jumpman Jordan? and Who is Air Jordan?

What feature should I add for training? Can the model capture the match info for this case?

Thank you！！！
@abhishekkrthakur

Share Build settings

Hi @abhishekkrthakur

Can you please share your keras version with which you have run this python file. I have tried with keras 1.2.2 and it is giving me following warnings while executing feature_engineering.py.

administrators-MacBook-Pro:is_that_a_duplicate_quora_question-master priyadhingra$ python feature_engineering.py
feature_engineering.py:24: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s1 = [w for w in s1 if w not in stop_words]
feature_engineering.py:25: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s2 = [w for w in s2 if w not in stop_words]
feature_engineering.py:33: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s1 = [w for w in s1 if w not in stop_words]
feature_engineering.py:34: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s2 = [w for w in s2 if w not in stop_words]
235it [00:01, 115.30it/s]feature_engineering.py:51: RuntimeWarning: invalid value encountered in double_scalars
return v / np.sqrt((v ** 2).sum())
404351it [02:21, 2862.68it/s]
404351it [02:17, 2932.63it/s]
/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py:702: RuntimeWarning: invalid value encountered in double_scalars
dist = 1.0 - uv / np.sqrt(uu * vv)

Please help here
Thanks in advance!!

How to run/use

Hey,

First of awesome article and idea.
I have been trying to get your demo up and running but have been struggling.
So far I have installed the following packages but still get the following errors when trying to run either of the files

python deepnet.py

  File "deepnet.py", line 19, in <module>
    y = data.is_duplicate.values
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'is_duplicate'

python feature_engineering.py
/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/fuzz.py:35: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
  File "feature_engineering.py", line 58, in <module>
    data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'question1'

whats the setup for the environment and how do you run/use the tooling.

abhishekkrthakur / is_that_a_duplicate_quora_question Goto Github PK

is_that_a_duplicate_quora_question's Introduction

is_that_a_duplicate_quora_question

is_that_a_duplicate_quora_question's People

Stargazers

Watchers

Forkers

is_that_a_duplicate_quora_question's Issues

Build model...

Recommend Projects

Recommend Topics

Recommend Org