Code Monkey home page Code Monkey logo

siamesesentencesimilarity's Introduction

SiameseSentenceSimilarity

SiameseSentenceSimilarity,个人实现的基于Siamese bilstm模型的相似句子判定模型,提供训练数据集和测试数据集.

项目介绍

句子相似度计算是自然语言处理中的一个重要技术手段,主要有两种方法:
1, 基于传统的无监督方式,传统的用于计算句子相似度的方式有很多种,感兴趣的,可以参考我的基于传统方法的句子相似读计算项目:
https://github.com/liuhuanyong/SentenceSimilarity

2, 基于标注数据的句子相似度计算.这个**大体是将句子相似度计算问题转换成一个相似句子类型判定问题,目前经典的方法是Siamese网络,这是本项目的一个初衷.

数据

数据集主要来源于CCKS2018评测项目微众银行客户问句匹配大赛, 总数据集大小为十万条.数据集样式如下:

'''
	怎么我开不了微利貸	怎么开不了户  录制不了 提示上传失败	0
	亲为什么我的审批不通过的	为什么还款及时会提示综合评估未通过	1
	你好,我借款的验证码发到我以前用的那个手机号码了,我该怎么设置呢	手机号码换了	1
	“如何获得微粒贷资格”	为什么没微粒贷啊	1
	为什么没接到电话	两天了,怎么还没有给我打电话审核?	1
	我的电话已改为	绑定的手机号码能不能更改	1
	借贷下来时间	10月国庆期间能借钱不	0
	什么时候才邀请?	什么时候才能申请	1
	上边可借56000元为什么申请不成功	为什么可借一万五,却借不出来	1
	1万利息是多少	10个月利息多少	1
	没经过审批	如何能通过微众银行审批要求	1
	延期3天还款收取逾期利息是多少?	14号还款日,逾期两天手续费是多少?	1
	申请的额度能取现吗	取现一次性取完可以吗	0
	利息与罚息如何计算	咱这个利息多高啊	1
	如何申请货款	怎样开通我微粒贷	1
	多久才有贷款	凌晨以后的申请何时到账	1
	你好 我要换卡怎么换 我卡掉了	换卡失败	0

	'''

模型

模型**:采用典型的siamese网络,两个句子分成左右两个部分进行输入,使用了四层双向lstm(权重共享)进行网络编码,最后计算两个编码之间的距离,最后做预测分类: 一 , 编码层:使用两个双向LSTM进行编码,权重共享

'''搭建编码层网络,用于权重共享'''
def create_base_network(self, input_shape):
    input = Input(shape=input_shape)
    lstm1 = Bidirectional(LSTM(128, return_sequences=True))(input)
    lstm1 = Dropout(0.5)(lstm1)
    lstm2 = Bidirectional(LSTM(32))(lstm1)
    lstm2 = Dropout(0.5)(lstm2)
    return Model(input, lstm2)
 
二, 左右句子编码相似度计算
'''基于曼哈顿空间距离计算两个字符串语义空间表示相似度计算'''
def exponent_neg_manhattan_distance(self, sent_left, sent_right):
    return K.exp(-K.sum(K.abs(sent_left - sent_right), axis=1, keepdims=True))

训练

模型 训练集 测试集 训练集准确率 测试集准确率 备注
问句匹配 80000 20000 0.8125 0.7956 20个epcho

总结

1,句子相似度计算是自然语言处理中的一个重要技术手段,本文简单实现了simamese相似度计算网络.
2,通过LSTM编码,曼哈顿距离作为相似读衡量的网络,在训练集上达到了0.81,测试集达到0.7956的准确率.
3,目前关于相似度计算的网络有很多,本项目是一个基础,后期将逐步学习,尝试其他网络.
4,将传统的相似度计算方式和深度学习网络进行融合,或许是可以做的一个点.

contact

如有自然语言处理、知识图谱、事理图谱、社会计算、语言资源建设等问题或合作,请联系我:
邮箱:[email protected]
csdn:https://blog.csdn.net/lhy2014
我的自然语言处理项目: https://liuhuanyong.github.io/
刘焕勇,**科学院软件研究所

siamesesentencesimilarity's People

Contributors

liuhuanyong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

siamesesentencesimilarity's Issues

save error

Traceback (most recent call last):
File "siamese_model.py", line 242, in
handler.train_model()
File "siamese_model.py", line 195, in train_model
model.save(self.model_path)
File "/root/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1090, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/root/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 382, in save_model
_serialize_model(model, f, include_optimizer)
File "/root/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 84, in _serialize_model
model_config = json.dumps(model_config, default=get_json_type)
File "/root/anaconda3/lib/python3.6/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/root/anaconda3/lib/python3.6/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/root/anaconda3/lib/python3.6/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/root/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 74, in get_json_type
raise TypeError('Not JSON Serializable: %s' % (obj,))
TypeError: Not JSON Serializable: <main.SiameseNetwork object at 0x7feec034c710>

关于Keras版本以及tf版本问题

Keras低于或等于2.1.2 tf版本为1.0.0抛出如下异常:
passed an input_mask: Tensor("embedding_1/NotEqual:0", shape=(?, 25), dtype=bool)
Keras为2.2.4 tf版本为1.0.0 抛出如下异常:
TypeError: while_loop() got an unexpected keyword argument 'maximum_iterations'
Keras为2.2.4 tf版本为1.13.1 抛出如下异常:
AttributeError: module '_pywrap_tensorflow_internal' has no attribute 'TFE_DEVICE_PLACEMENT_EXPLICIT_swigconstant'

请教下tf的版本应该是多少,,,

句子长度问题

您好,看您代码,句子是全部处理成了固定长度,这个长度是这样选择的:通过collections.Counter构造计数器,倒排序单词数最多的句子对,然后对单词数最多的句子频数除以句子个数,将这个比率一直累加下去,直到它超过一个给定的阈值,也就是保证覆盖大部分句子,然后选择阈值处那个句子的单词数作为最终长度。因为Counter使用了倒排序以及接近1的阈值,那么选择出来的长度假设是10,也就是说,10个单词长度的句子,在全部训练集中出现的次数是比较少的,也就是大部分句子长度要么大于10要么小于10,而且因为要把全部句子处理成10的长度,会不会出现这样一种情况:假设很多句子长度大于10,通过截断到10的长度,会不会损失了很多信息 ?

TypeError: while_loop() got an unexpected keyword argument 'maximum_iterations'

ub16hp@UB16HP:/media/ub16hp/WINDOWS/ub16_prj/liuhuanyong/SiameseSentenceSimilarity$ sudo pip3.5 install keras==2.2.4
The directory '/home/ub16hp/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/ub16hp/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: keras==2.2.4 in /usr/local/lib/python3.5/dist-packages (2.2.4)
Requirement already satisfied: keras-applications>=1.0.6 in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (1.0.6)
Requirement already satisfied: numpy>=1.9.1 in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (1.14.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (1.0.5)
Requirement already satisfied: scipy>=0.14 in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (1.0.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (3.13)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (1.11.0)
Requirement already satisfied: h5py in /usr/local/lib/python3.5/dist-packages (from keras==2.2.4) (2.8.0)
ub16hp@UB16HP:/media/ub16hp/WINDOWS/ub16_prj/liuhuanyong/SiameseSentenceSimilarity$ python3.5 siamese_model.py
Using TensorFlow backend.
/home/ub16hp/.local/lib/python3.5/site-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.24.1) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
100000 100000
average_length: 11.68037
max_length: 25
Found 20028 word vectors.
[[0]
[0]
[0]
...
[1]
[1]
[0]]
Traceback (most recent call last):
File "siamese_model.py", line 242, in
handler.train_model()
File "siamese_model.py", line 186, in train_model
model = self.bilstm_siamese_model()
File "siamese_model.py", line 166, in bilstm_siamese_model
shared_lstm = self.create_base_network(input_shape=(self.TIME_STAMPS, self.EMBEDDING_DIM))
File "siamese_model.py", line 145, in create_base_network
lstm1 = Bidirectional(LSTM(128, return_sequences=True))(input)
File "/usr/local/lib/python3.5/dist-packages/keras/layers/wrappers.py", line 427, in call
return super(Bidirectional, self).call(inputs, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/layers/wrappers.py", line 522, in call
y = self.forward_layer.call(inputs, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/layers/recurrent.py", line 2194, in call
initial_state=initial_state)
File "/usr/local/lib/python3.5/dist-packages/keras/layers/recurrent.py", line 649, in call
input_length=timesteps)
File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 3011, in rnn
maximum_iterations=input_length)
TypeError: while_loop() got an unexpected keyword argument 'maximum_iterations'
ub16hp@UB16HP:/media/ub16hp/WINDOWS/ub16_prj/liuhuanyong/SiameseSentenceSimilarity$

Keras 版本

TypeError: Layer input_3 does not support masking, but was passed an input_mask: Tensor("embedding_1/NotEqual:0", shape=(?, 25), dtype=bool)

你好请问下 Keras的本是多少,我查了下上面的错误,应该是版本的问题

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.