percent4 / spo_extract_platform Goto Github PK

View Code? Open in Web Editor NEW

303.0 10.0 47.0 71.03 MB

本项目是利用深度学习技术来构建知识图谱方向上的一次尝试，作为开放领域的关系抽取，算是笔者的一次创新，目前在这方面的文章和项目都很少。

Python 96.28% Shell 0.26% Perl 2.20% HTML 1.26%

spo_extract_platform's Introduction

开放领域的关系抽取的一次尝试

平台组成

标注平台（前端网页），对应目录spo_tagging_platform；
标注内容： S,P,O, is_tagging, 原文以及SPO的关系。
模型：

S,P,O: 序列标注算法（ALBERT+BiLSTM+CRF），对应目录sequence_labeling，在测试集上的F1大约为81%；
关系抽取: 文本二分类(ALBERT+BiGRU+ATT)，对应目录text_classification，在测试集上的准确率大约为96%。

标注语料来源于新闻内容和小说内容。

该项目在提取小说、新闻以及其他无结构文本方面的应用，对应目录为extract_example。

数据介绍

现阶段的序列标注算法的样本为3211个，关系抽取的标注数据为9279，共有关系1365个，数量最多的前20个关系如下图：

平台使用前的准备工作

该平台采用Python3开发，需要安装的模块参考requirements.txt

如何使用该平台？

序列标注算法和文本二分类已经训练好，可以直接clone下来使用。

运行sequence_labeling/run.py，该HTTP服务运行端口为12306；
运行text_classification/extract_server.py，该HTTP服务运行端口为12308；

在Postman中输入如下（输入为一个句子，句子不宜过长，建议句子长度不超过128个字）：

平台效果

该平台标注的时候，标注内容大部分为人物头衔，人物关系，公司与人的关系，影视剧主演、导演信息等。

当句子有只有一对三元组的时候，效果相对较好。

extract_example目录中为抽取的效果，包括几本小说和一些新闻上的效果，关于这方面的演示，可以参考另一个项目：https://github.com/percent4/knowledge_graph_demo 。

一些句子也存在抽取出无用的三元组的情况，导致召回率偏高，这是因为本项目针对的是开放领域的三元组抽取，因此效果比不会有想象中的那么好，提升抽取效果的办法如下：

增加数据标注量，目前序列标注算法的样本仅3211多个；
模型方面：现在是pipeline形式，各自的效果还行，但总体上不如Joint形式好；
对于自己想抽的其他三元组的情形，建议增加这方面的标注；
文本预测耗时长（该问题已经解决）。

交流

本项目作为笔者在开放领域的三元组抽取的一次尝试，在此之前关于这方面的文章或者项目还很少，因此可以说是探索阶段。

源码和数据已经在项目中给出。

如需要更深一步的交流，请发送消息至邮箱[email protected]，或者在Github上直接留言。

本人的微信公众号为Python爬虫与算法，欢迎关注~

spo_extract_platform's People

Contributors

Stargazers

Watchers

Forkers

hdu-jimlau jimmy-walker ripingit yangzongze leileixiao gdh756462786 leiyis99 yihichan caoxu915683474 skywindy mayuehui zerounnet colinsongf qiannianqiannian may-sunshine 280185386 bigbrobro johndonie ixeagle advancer-debug fendaq ztwu zhunanyang xbad lxy117 gintian guowei-su gamf999 xianjin-xu drickv5 emir-liu jsonlmy hi0ne wscq823 chekaiyue shiwuxian sherryran08 creator-123 ispml tantailong starax0519 andrew2010y techidiot einyboycode mazhuang123 airace

spo_extract_platform's Issues

文本分类中，原文中的spo为什么要用字母代替

关于标注的问题

spo tag的方式是自定义的吗，我可以使用别的tag方式吗

在Postman中输入的样例。。图片不见了。。。直接访问会有400: Bad Request

404Notfound

运行跑不出页面，求大佬指点一下。

好像tensorflow版本有问题

好像tensorflow版本有问题
代码中tensorflow.contrib是tensorflow1.x的，tf降级后报了更多错

运行run的问题

运行run之后服务没有跑起来

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From f:/代码/spo_extract_platform/spo_extract_platform-master/sequence_labeling/runServer.py:34: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From f:/代码/spo_extract_platform/spo_extract_platform-master/sequence_labeling/runServer.py:39: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-09-15 14:49:14.849269: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:28: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

embedding_lookup_factorized. factorized embedding parameterization is used.
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\modeling.py:482: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.     

ln_type: postln
old structure of transformer.use: transformer_model,which use post-LN
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\modeling.py:750: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:45: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in 
a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:136: bidirectional_dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
WARNING:tensorflow:From E:\anaconda\anaconda\envs\spo_extract\lib\site-packages\tensorflow\python\ops\rnn.py:464: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
WARNING:tensorflow:From E:\anaconda\anaconda\envs\spo_extract\lib\site-packages\tensorflow\python\ops\init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will 
be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From E:\anaconda\anaconda\envs\spo_extract\lib\site-packages\tensorflow\python\ops\rnn.py:244: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed 
in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:153: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:66: The name tf.train.init_from_checkpoint is deprecated. Please use tf.compat.v1.train.init_from_checkpoint instead.

**** Trainable Variables ****
  name = %s, shape = %s%s bert/embeddings/word_embeddings:0 (21128, 128) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/word_embeddings_2:0 (128, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/token_type_embeddings:0 (2, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/position_embeddings:0 (512, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/LayerNorm/beta:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/LayerNorm/gamma:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/query/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/query/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/key/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/key/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/value/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/value/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/dense/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/dense/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/LayerNorm/beta:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/LayerNorm/gamma:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/intermediate/dense/kernel:0 (312, 1248) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/intermediate/dense/bias:0 (1248,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/dense/kernel:0 (1248, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/dense/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/LayerNorm/beta:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/LayerNorm/gamma:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/pooler/dense/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/pooler/dense/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_xi:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_hi:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_ci:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_xo:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_ho:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_co:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_xc:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_hc:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_b_i:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_b_c:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_b_o:0 (100,) 
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_xi:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_hi:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_ci:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_xo:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_ho:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_co:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_xc:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_hc:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_b_i:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_b_c:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_b_o:0 (100,)
  name = %s, shape = %s%s project/hidden/W:0 (200, 100)
  name = %s, shape = %s%s project/hidden/b:0 (100,)
  name = %s, shape = %s%s project/logits/W:0 (100, 9)
  name = %s, shape = %s%s project/logits/b:0 (9,)
  name = %s, shape = %s%s crf_loss/transitions:0 (10, 10)
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:81: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.     

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:95: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\utils.py:177: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.

你好，谢谢你开源的代码，我跑起来后，还是遇到了一个问题， spo预测都是逗号分隔，请问是哪里配置问题吗？

{'subjs': [], 'objs': ['5', '下', '，', '考', '了', '略', '集', '上', '办', '室', '该', '情'], 'preds': ['团', '海', '
公', '产']}
请求: {'原文': '3月5日下午，上海市政协副主席黄震考察了明略科技集团上海办公室，了解该集团防疫和复工复产情况。', '抽取结
果': []}

你好，请问这个三元组抽取可不可以应用于英文文本

keras.load_model会报错

会说AttributeError: 'str' object has no attribute 'decode'
这是因为h5py你这里的requirements里面是要3.1.0版本的，版本太高了，我换成2.10.0的才跑成功

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.