Code Monkey home page Code Monkey logo

yongzhuo / nlg-yongzhuo Goto Github PK

View Code? Open in Web Editor NEW
404.0 6.0 52.0 96 KB

中文文本生成(NLG)之文本摘要(text summarization)工具包, 语料数据(corpus data), 抽取式摘要 Extractive text summary of Lead3、keyword、textrank、text teaser、word significance、LDA、LSI、NMF。(graph,feature,topic model,summarize tool or tookit)

Home Page: https://blog.csdn.net/rensihui

License: MIT License

Python 100.00%
nlg text-summarization lead3 textrank textteaser lda lsi nmf word-significance tookit

nlg-yongzhuo's Introduction

PyPI Build Status PyPI_downloads Stars Forks Join the chat at https://gitter.im/yongzhuo/nlg-yongzhuo

Install(安装)

pip install nlg-yongzhuo

API(联合调用, 整合几种算法)

from nlg_yongzhuo import *

doc = """PageRank算法简介。" \
              "是上世纪90年代末提出的一种计算网页权重的算法! " \
              "当时,互联网技术突飞猛进,各种网页网站爆炸式增长。 " \
              "业界急需一种相对比较准确的网页重要性计算方法。 " \
              "是人们能够从海量互联网世界中找出自己需要的信息。 " \
              "百度百科如是介绍他的**:PageRank通过网络浩瀚的超链接关系来确定一个页面的等级。 " \
              "Google把从A页面到B页面的链接解释为A页面给B页面投票。 " \
              "Google根据投票来源甚至来源的来源,即链接到A页面的页面。 " \
              "和投票目标的等级来决定新的等级。简单的说, " \
              "一个高等级的页面可以使其他低等级页面的等级提升。 " \
              "具体说来就是,PageRank有两个基本**,也可以说是假设。 " \
              "即数量假设:一个网页被越多的其他页面链接,就越重)。 " \
              "质量假设:一个网页越是被高质量的网页链接,就越重要。 " \
              "总的来说就是一句话,从全局角度考虑,获取重要的信。 """.replace(" ", "").replace('"', '')

# fs可以填其中一个或几个 text_pronouns, text_teaser, mmr, text_rank, lead3, lda, lsi, nmf
res_score = text_summarize(doc, fs=[text_pronouns, text_teaser, mmr, text_rank, lead3, lda, lsi, nmf])
for rs in res_score:
    print(rs)

Usage(调用),详情见/test/目录下

# feature_base
from nlg_yongzhuo import word_significance
from nlg_yongzhuo import text_pronouns
from nlg_yongzhuo import text_teaser
from nlg_yongzhuo import mmr
# graph_base
from nlg_yongzhuo import text_rank
# topic_base
from nlg_yongzhuo import lda
from nlg_yongzhuo import lsi
from nlg_yongzhuo import nmf
# nous_base
from nlg_yongzhuo import lead3


docs ="和投票目标的等级来决定新的等级.简单的说。" \
          "是上世纪90年代末提出的一种计算网页权重的算法! " \
          "当时,互联网技术突飞猛进,各种网页网站爆炸式增长。" \
          "业界急需一种相对比较准确的网页重要性计算方法。" \
          "是人们能够从海量互联网世界中找出自己需要的信息。" \
          "百度百科如是介绍他的**:PageRank通过网络浩瀚的超链接关系来确定一个页面的等级。" \
          "Google把从A页面到B页面的链接解释为A页面给B页面投票。" \
          "Google根据投票来源甚至来源的来源,即链接到A页面的页面。" \
          "一个高等级的页面可以使其他低等级页面的等级提升。" \
          "具体说来就是,PageRank有两个基本**,也可以说是假设。" \
          "即数量假设:一个网页被越多的其他页面链接,就越重)。" \
          "质量假设:一个网页越是被高质量的网页链接,就越重要。" \
          "总的来说就是一句话,从全局角度考虑,获取重要的信。"
# 1. word_significance
sums_word_significance = word_significance.summarize(docs, num=6)
print("word_significance:")
for sum_ in sums_word_significance:
    print(sum_)

# 2. text_pronouns
sums_text_pronouns = text_pronouns.summarize(docs, num=6)
print("text_pronouns:")
for sum_ in sums_text_pronouns:
    print(sum_)

# 3. text_teaser
sums_text_teaser = text_teaser.summarize(docs, num=6)
print("text_teaser:")
for sum_ in sums_text_teaser:
    print(sum_)
# 4. mmr
sums_mmr = mmr.summarize(docs, num=6)
print("mmr:")
for sum_ in sums_mmr:
    print(sum_)
# 5.text_rank
sums_text_rank = text_rank.summarize(docs, num=6)
print("text_rank:")
for sum_ in sums_text_rank:
    print(sum_)
# 6. lda
sums_lda = lda.summarize(docs, num=6)
print("lda:")
for sum_ in sums_lda:
    print(sum_)
# 7. lsi
sums_lsi = lsi.summarize(docs, num=6)
print("mmr:")
for sum_ in sums_lsi:
    print(sum_)
# 8. nmf
sums_nmf = nmf.summarize(docs, num=6)
print("nmf:")
for sum_ in sums_nmf:
    print(sum_)
# 9. lead3
sums_lead3 = lead3.summarize(docs, num=6)
print("lead3:")
for sum_ in sums_lead3:
    print(sum_)

nlg_yongzhuo

- text_summary
- text_augnment(todo)
- text_generation(todo)
- text_translation(todo)

run(运行, 以text_teaser为例)

- 1. 直接进入目录文件运行即可, 例如进入:nlg_yongzhuo/text_summary/feature_base/
- 2. 运行: python text_teaser.py

nlg_yongzhuo/data

模型与论文paper与地址

参考/感谢

*希望对你有所帮助!

nlg-yongzhuo's People

Contributors

moyongzhuo avatar yongzhuo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

nlg-yongzhuo's Issues

关于模型加速的问题

请问这个text_summary_merge能不能使用GPU加速呢,因为要大批量的处理数据,如果能使用GPU加速的话就好很多,直接跑的速度确实有点太久了

在import时出现问题:cannot import name '_get_n_jobs'

ImportError                               Traceback (most recent call last)
<ipython-input-11-bb77d769d63d> in <module>
     15 from bs4 import BeautifulSoup
     16 from nltk.stem import WordNetLemmatizer
---> 17 from nlg_yongzhuo import *

~/anaconda3/envs/python36/lib/python3.6/site-packages/nlg_yongzhuo/__init__.py in <module>
     15 from nlg_yongzhuo.text_summarization.extractive_sum.nous_base.lead_3.lead_3 import Lead3Sum
     16 
---> 17 from nlg_yongzhuo.text_summarization.extractive_sum.topic_base.topic_lda import LDASum
     18 from nlg_yongzhuo.text_summarization.extractive_sum.topic_base.topic_lsi import LSISum
     19 from nlg_yongzhuo.text_summarization.extractive_sum.topic_base.topic_nmf import NMFSum

~/anaconda3/envs/python36/lib/python3.6/site-packages/nlg_yongzhuo/text_summarization/extractive_sum/topic_base/topic_lda.py in <module>
     13 # sklearn
     14 from sklearn.feature_extraction.text import CountVectorizer
---> 15 from sklearn.decomposition import LatentDirichletAllocation
     16 import numpy as np
     17 

~/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/decomposition/__init__.py in <module>
      9 from .incremental_pca import IncrementalPCA
     10 from .kernel_pca import KernelPCA
---> 11 from .sparse_pca import SparsePCA, MiniBatchSparsePCA
     12 from .truncated_svd import TruncatedSVD
     13 from .fastica_ import FastICA, fastica

~/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/decomposition/sparse_pca.py in <module>
     11 from ..linear_model import ridge_regression
     12 from ..base import BaseEstimator, TransformerMixin
---> 13 from .dict_learning import dict_learning, dict_learning_online
     14 
     15 

~/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/decomposition/dict_learning.py in <module>
     18 from ..externals.joblib import Parallel, delayed, cpu_count
     19 from ..externals.six.moves import zip
---> 20 from ..utils import (check_array, check_random_state, gen_even_slices,
     21                      gen_batches, _get_n_jobs)
     22 from ..utils.extmath import randomized_svd, row_norms

ImportError: cannot import name '_get_n_jobs'

意见反馈

能够提供一个完整的东西,给一篇文章,直接输出摘要。

使用pip安装过程中出现下列错误,请问如何解决?

error: subprocess-exited-with-error

python setup.py egg_info did not run successfully.
exit code: 1

[15 lines of output]
C:\Users\hs6\AppData\Local\Temp\pip-install-mw7z_iai\pandas_fd94fc07bd02449c82bdc4a3c67de894\setup.py:12: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
C:\Users\hs6\anaconda3\Lib\site-packages\setuptools_init_.py:84: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!

      ********************************************************************************
      Requirements should be satisfied by a PEP 517 installer.
      If you are using pip, you can try `pip install --use-pep517`.
      ********************************************************************************

!!
dist.fetch_build_eggs(dist.setup_requires)
error in pandas setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers; Expected end or semicolon (after version specifier)
pytz >= 2011k
~~~~~~~^
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.

See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Question about sklearn's lda result

topic_lda.py
...
res_lda_v = lda.components_
...
"lda.componets" represent words distribution about the topic.
However,
'''
else:
### 方案二, 获取最大主题概率的句子, 不分主题
res_combine = {}
for i in range(len_sentences_cut):
res_row_i = res_lda_v[:, i]
res_row_i_argmax = np.argmax(res_row_i)
res_combine[self.sentences[i]] = res_row_i[res_row_i_argmax]
'''
所以用res_lda_v[:, i]取出来的是单词的概率, 但最后计算的是句子的概率,所以这个地方我有些疑问,句子是怎么抽取的?是不是在前面用lda算的时候就是用的句子呢

Lead-3 疑问

对于Lead-3算法,我有两点疑问?

  1. Lead3Sum类中的summarize中,num_min的计算是否必要?
  2. 在计算得分的公式有什么原理吗?而且计算后的结果是降序到最后一个,最后一个又突增,在我看来和mix没有区别。

数据问题

你好,你的摘要数据用的是百度云里哪一个数据啊?

pip 安装的问题

thank you for your contribution
ubuntu18.04试过python3.7 3.8,安装会出现问题,同#13
python3.6 bug如下:

File "/home/xxx/Software/yes/envs/nlg/lib/python3.6/site-packages/smart_open/s3.py", line 9
from future import annotations
^
SyntaxError: future feature annotations is not defined

关于text pprocess问题

你好,感谢你的代码开源!
有两个问题请教:

  1. LDA模型如果想用在移动段效率怎么样
  2. c/c++中有可用的文本前处理工具包吗,用于分词词干提取之类的

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.