Code Monkey home page Code Monkey logo

spo_extract_platform's Introduction

开放领域的关系抽取的一次尝试

平台组成

  1. 标注平台(前端网页),对应目录spo_tagging_platform
  2. 标注内容: S,P,O, is_tagging, 原文以及SPO的关系。
  3. 模型:
  • S,P,O: 序列标注算法(ALBERT+BiLSTM+CRF),对应目录sequence_labeling,在测试集上的F1大约为81%;
  • 关系抽取: 文本二分类(ALBERT+BiGRU+ATT),对应目录text_classification,在测试集上的准确率大约为96%。

标注语料来源于新闻内容和小说内容。

  1. 该项目在提取小说、新闻以及其他无结构文本方面的应用,对应目录为extract_example

数据介绍

  现阶段的序列标注算法的样本为3211个,关系抽取的标注数据为9279,共有关系1365个,数量最多的前20个关系如下图:

平台使用前的准备工作

  • 该平台采用Python3开发,需要安装的模块参考requirements.txt

如何使用该平台?

序列标注算法文本二分类已经训练好,可以直接clone下来使用。

  1. 运行sequence_labeling/run.py,该HTTP服务运行端口为12306;

  2. 运行text_classification/extract_server.py,该HTTP服务运行端口为12308;

在Postman中输入如下(输入为一个句子,句子不宜过长,建议句子长度不超过128个字):

平台效果

  该平台标注的时候,标注内容大部分为人物头衔,人物关系,公司与人的关系,影视剧主演、导演信息等。

  当句子有只有一对三元组的时候,效果相对较好。

  extract_example目录中为抽取的效果,包括几本小说和一些新闻上的效果,关于这方面的演示,可以参考另一个项目:https://github.com/percent4/knowledge_graph_demo

  一些句子也存在抽取出无用的三元组的情况,导致召回率偏高,这是因为本项目针对的是开放领域的三元组抽取,因此效果比不会有想象中的那么好,提升抽取效果的办法如下:

  • 增加数据标注量,目前序列标注算法的样本仅3211多个;
  • 模型方面:现在是pipeline形式,各自的效果还行,但总体上不如Joint形式好;
  • 对于自己想抽的其他三元组的情形,建议增加这方面的标注;
  • 文本预测耗时长(该问题已经解决)。

交流

  本项目作为笔者在开放领域的三元组抽取的一次尝试,在此之前关于这方面的文章或者项目还很少,因此可以说是探索阶段。

  源码和数据已经在项目中给出。

  如需要更深一步的交流,请发送消息至邮箱[email protected],或者在Github上直接留言。

  本人的微信公众号为Python爬虫与算法,欢迎关注~

spo_extract_platform's People

Contributors

jclian91 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spo_extract_platform's Issues

404Notfound

运行跑不出页面,求大佬指点一下。
image

运行run的问题

运行run之后服务没有跑起来

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From f:/代码/spo_extract_platform/spo_extract_platform-master/sequence_labeling/runServer.py:34: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From f:/代码/spo_extract_platform/spo_extract_platform-master/sequence_labeling/runServer.py:39: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-09-15 14:49:14.849269: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:28: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

embedding_lookup_factorized. factorized embedding parameterization is used.
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\modeling.py:482: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.     

ln_type: postln
old structure of transformer.use: transformer_model,which use post-LN
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\albert_zh\modeling.py:750: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:45: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in 
a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:136: bidirectional_dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
WARNING:tensorflow:From E:\anaconda\anaconda\envs\spo_extract\lib\site-packages\tensorflow\python\ops\rnn.py:464: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
WARNING:tensorflow:From E:\anaconda\anaconda\envs\spo_extract\lib\site-packages\tensorflow\python\ops\init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will 
be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From E:\anaconda\anaconda\envs\spo_extract\lib\site-packages\tensorflow\python\ops\rnn.py:244: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed 
in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:153: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:66: The name tf.train.init_from_checkpoint is deprecated. Please use tf.compat.v1.train.init_from_checkpoint instead.

**** Trainable Variables ****
  name = %s, shape = %s%s bert/embeddings/word_embeddings:0 (21128, 128) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/word_embeddings_2:0 (128, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/token_type_embeddings:0 (2, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/position_embeddings:0 (512, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/LayerNorm/beta:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/embeddings/LayerNorm/gamma:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/query/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/query/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/key/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/key/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/value/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/self/value/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/dense/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/dense/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/LayerNorm/beta:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/attention/output/LayerNorm/gamma:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/intermediate/dense/kernel:0 (312, 1248) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/intermediate/dense/bias:0 (1248,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/dense/kernel:0 (1248, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/dense/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/LayerNorm/beta:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/encoder/layer_shared/output/LayerNorm/gamma:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/pooler/dense/kernel:0 (312, 312) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s bert/pooler/dense/bias:0 (312,) , *INIT_FROM_CKPT*
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_xi:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_hi:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_ci:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_xo:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_ho:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_co:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_xc:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_w_hc:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_b_i:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_b_c:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/fw/coupled_input_forget_gate_lstm_cell/_b_o:0 (100,) 
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_xi:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_hi:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_ci:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_xo:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_ho:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_co:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_xc:0 (312, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_w_hc:0 (100, 100)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_b_i:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_b_c:0 (100,)
  name = %s, shape = %s%s char_BiLSTM/bidirectional_rnn/bw/coupled_input_forget_gate_lstm_cell/_b_o:0 (100,)
  name = %s, shape = %s%s project/hidden/W:0 (200, 100)
  name = %s, shape = %s%s project/hidden/b:0 (100,)
  name = %s, shape = %s%s project/logits/W:0 (100, 9)
  name = %s, shape = %s%s project/logits/b:0 (9,)
  name = %s, shape = %s%s crf_loss/transitions:0 (10, 10)
WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:81: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.     

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\model.py:95: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From f:\代码\spo_extract_platform\spo_extract_platform-master\sequence_labeling\utils.py:177: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.

keras.load_model会报错

会说AttributeError: 'str' object has no attribute 'decode'
这是因为h5py你这里的requirements里面是要3.1.0版本的,版本太高了,我换成2.10.0的才跑成功

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.