Code Monkey home page Code Monkey logo

is_that_a_duplicate_quora_question's Introduction

is_that_a_duplicate_quora_question

all the code for the article https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur will be available here..

How To

  1. Install Required Libraries
pip install pandas
pip install numpy
pip install scikit-learn
pip install nltk
pip install tqdm
pip install keras
pip install tensorflow
pip install pyemd
pip install fuzzywuzzy
pip install python-levenshtein
pip install --upgrade gensim
  1. Download Required Language libraries
mkdir data
cd data
wget http://www-nlp.stanford.edu/data/glove.840B.300d.zip
unzip glove.840B.300d.zip
rm glove.840B.300d.zip
wget http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv
wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
sudo python -m nltk.downloader stopwords
cd ..
  1. Run
python feature_engineering.py
python deepnet.py

is_that_a_duplicate_quora_question's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

is_that_a_duplicate_quora_question's Issues

Not the final code?

This does not look to be the final version of your code.

The features file is created, but is not used by deepnet.py, for example.

Also, I've got deepnet.py generally running, but the validation loss drifts upward with accuracy. This is a mark of a model that is not big enough to store its info, and the data has high variance, so the model overtrains.

I just noticed that the data is not shuffled during prep.

Shows error on tensorflow backend

I am running the deepnet.py file but the following error is coming up-

Build model...

TypeError Traceback (most recent call last)
in ()
80 model5 = Sequential()
81 model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2))
---> 82 model5.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2))
83
84 model6 = Sequential()

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/models.pyc in add(self, layer)
453 output_shapes=[self.outputs[0]._keras_shape])
454 else:
--> 455 output_tensor = layer(self.outputs[0])
456 if isinstance(output_tensor, list):
457 raise TypeError('All layers in a Sequential model '

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/layers/recurrent.pyc in call(self, inputs, initial_state, **kwargs)
250 else:
251 kwargs['initial_state'] = initial_state
--> 252 return super(Recurrent, self).call(inputs, **kwargs)
253
254 def call(self, inputs, mask=None, initial_state=None, training=None):

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/engine/topology.pyc in call(self, inputs, **kwargs)
552
553 # Actually call the layer, collecting output(s), mask(s), and shape(s).
--> 554 output = self.call(inputs, **kwargs)
555 output_mask = self.compute_mask(inputs, previous_mask)
556

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/layers/recurrent.pyc in call(self, inputs, mask, initial_state, training)
288 'or batch_shape argument to your Input layer.')
289 constants = self.get_constants(inputs, training=None)
--> 290 preprocessed_input = self.preprocess_input(inputs, training=None)
291 last_output, outputs, states = K.rnn(self.step,
292 preprocessed_input,

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/layers/recurrent.pyc in preprocess_input(self, inputs, training)
1031 self.dropout, input_dim, self.units,
1032 timesteps, training=training)
-> 1033 return K.concatenate([x_i, x_f, x_c, x_o], axis=2)
1034 else:
1035 return inputs

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.pyc in concatenate(tensors, axis)
1525 return tf.sparse_concat(axis, tensors)
1526 else:
-> 1527 return tf.concat([to_dense(x) for x in tensors], axis)
1528
1529

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.pyc in concat(concat_dim, values, name)
1073 ops.convert_to_tensor(concat_dim,
1074 name="concat_dim",
-> 1075 dtype=dtypes.int32).get_shape(
1076 ).assert_is_compatible_with(tensor_shape.scalar())
1077 return identity(values[0], name=scope)

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.pyc in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
667
668 if ret is None:
--> 669 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
670
671 if ret is NotImplemented:

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.pyc in _constant_tensor_conversion_function(v, dtype, name, as_ref)
174 as_ref=False):
175 _ = as_ref
--> 176 return constant(v, dtype=dtype, name=name)
177
178

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.pyc in constant(value, dtype, shape, name, verify_shape)
163 tensor_value = attr_value_pb2.AttrValue()
164 tensor_value.tensor.CopyFrom(
--> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
167 const_tensor = g.create_op(

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.pyc in make_tensor_proto(values, dtype, shape, verify_shape)
365 nparray = np.empty(shape, dtype=np_dt)
366 else:
--> 367 _AssertCompatible(values, dtype)
368 nparray = np.array(values, dtype=np_dt)
369 # check to them.

/home/nafizh/anaconda3/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.pyc in _AssertCompatible(values, dtype)
300 else:
301 raise TypeError("Expected %s, got %s of type '%s' instead." %
--> 302 (dtype.name, repr(mismatch), type(mismatch).name))
303
304

TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

I am using keras 2.0.1 on tensorflow-gpu 0.12.1.

Can't achieve your base line

Hello,
thanks for your notes
I'm trying to implement your baseline (using features with simple logistic regression classifier), I'm using the fs3-2 part of features.

  1. I can't find any line in your code to calculate tf-idf of question's words
  2. as I calculate it eb myself and also do not calculating I can't achieve the accuracy that you mentioned in your report (your said that it is 0.804 with logistic regression) can you share your classifier code or at least its parameters?

I'm looking to find what is my mistake.
thank you in advance.

How do you make predictions?

Thanks for sharing this repo. I've learned a great deal from it!

I'm new to deep learning and using word embeddings in a neural net, and I'm having issues making predictions once I've trained the model. I keep getting a no element found in index or some such error, which I assume is caused by a failed lookup because some words in the test dataset may not exist in the embedding matrix. I arrived at this assumption because predictions work fine if I use the training dataset.

I can provide the exact error message, but I'm just wondering if you can think of something off the top of your head.

Thanks.

how to load dnn net file?

hello!
when learning the keras code "deepnet.py", i'm confused to file 'weights.h5'. i don't know how to reload the file, when use the model again.

load_word2vec_format error

@abhishekkrthakur
thanks for you sharing. But I find the below code is error.[OSError: Not a gzipped file (b've')] :
model=gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

Share Build settings

Hi @abhishekkrthakur

Can you please share your keras version with which you have run this python file. I have tried with keras 1.2.2 and it is giving me following warnings while executing feature_engineering.py.

administrators-MacBook-Pro:is_that_a_duplicate_quora_question-master priyadhingra$ python feature_engineering.py
feature_engineering.py:24: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s1 = [w for w in s1 if w not in stop_words]
feature_engineering.py:25: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s2 = [w for w in s2 if w not in stop_words]
feature_engineering.py:33: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s1 = [w for w in s1 if w not in stop_words]
feature_engineering.py:34: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
s2 = [w for w in s2 if w not in stop_words]
235it [00:01, 115.30it/s]feature_engineering.py:51: RuntimeWarning: invalid value encountered in double_scalars
return v / np.sqrt((v ** 2).sum())
404351it [02:21, 2862.68it/s]
404351it [02:17, 2932.63it/s]
/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py:702: RuntimeWarning: invalid value encountered in double_scalars
dist = 1.0 - uv / np.sqrt(uu * vv)

Please help here
Thanks in advance!!

How to run/use

Hey,

First of awesome article and idea.
I have been trying to get your demo up and running but have been struggling.
So far I have installed the following packages but still get the following errors when trying to run either of the files

python deepnet.py

  File "deepnet.py", line 19, in <module>
    y = data.is_duplicate.values
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'is_duplicate'
python feature_engineering.py
/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/fuzz.py:35: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
  File "feature_engineering.py", line 58, in <module>
    data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'question1'

whats the setup for the environment and how do you run/use the tooling.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.