miso-belica / sumy Goto Github PK

Module for automatic summarization of text documents and HTML pages.

Home Page: https://miso-belica.github.io/sumy/

License: Apache License 2.0

Python 99.70% HTML 0.19% Dockerfile 0.11%

python lsa textteaser html-page summarizer pagerank-algorithm reduction text-extraction html-extraction html-extractor sumy summarization summary nlp

sumy's Issues

question: LexRank does not work when the number of sentences is smaller than 20

I have tried to use the LexRank algorithm to summarize articles. However, it does not work when the number of sentences of the article is very short, such as below 20. It only extracts the sentences according to the order of the article and makes those sentences becoming the summary.

multiple document summarisation

I don't know if this can work for multidocument summary.
I tried using a folder(where all txt fils were) but it didn't work. it worked for single document( i used lexrank and texrank) but is there a way to feed in multiple text files and get just one summary?
Thank you

Edmundson summarization technique implementation

Hi,

thank you for the wonderful module!

I'm working in Scala trying to write Edmundson summarisation technique like the one used in your module, can you please give me a reference / paper / pseudo-code implementation please.

Have you got a list of academic papers as references for formulas?

Edmundson summarizer

Hi,

I checked the code for Edmundson summarizer. As I figured out it doesn't do anything for English. Basically it suppose to extract cue words and significant words and the words in title and rank the sentences based in these scores and the location. Well, when the input is a raw text file, then the summarizer works based on the location of the sentence. Is that right? There is no method to extract the cue words and significant words as well as title words for the text. So in this way the implementation is wrong I suppose. Let me know if I did not understand your code or I'm making a mistake? Thanks.

Error in SumBasic Summarizer

This is a great library.
But it throws up an Attribute Error in the SumBasic Summarizer. I looked at the code. It turns out the stop_words attribute is not defined in the summarizer class. I guess that is an error. If so, could you please look into it.

spanish support for sumy

Hi, I would like that you add spanish support for the project.
In my town there are a research group very interested in this project with the spanish support

Is sumy utilising any corpus?

I was wondering if e.g. the brown corpus would be utilised to achieve better LSA results?
Or are the corpora only utilised for tagging?

IMHO incorrect work LexRankSummarizer from alex grig

Hi!
I will be brief, so as not to distract.

When working algorithm produces a row the first proposals from the text.
for example: LexRankSummarizer: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (numbers are the sequence number proposals in the in the source text.)
Alternative implementation https://github.com/TafadzwaPasipanodya/Nutshell gives the following result on the same text:
Results LexRank: [94, 57, 42, 76, 66, 86, 83, 63]

Regards Alexander

ImportError: No module named copy_reg

Hi,

I have nltk 2.0.4 installed and the latest version of sumy. I'm running into the following error, any idea why?

>>> from sumy.nlp.tokenizers import Tokenizer
>>> from sumy.parsers.plaintext import PlaintextParser
>>> PlaintextParser.from_string("foo", Tokenizer('english'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\project\libs\sumy\nlp\tokenizers.py", line 26, in __init__
    self._sentence_tokenizer = self._sentence_tokenizer(tokenizer_language)
  File "C:\project\libs\sumy\nlp\tokenizers.py", line 37, in _sentence_tokenizer
    return nltk.data.load(to_string("file:") + file_path)
  File "c:\Python27\lib\site-packages\nltk\data.py", line 605, in load
    resource_val = pickle.load(_open(resource_url))
ImportError: No module named copy_reg

about KLSummarizer

There are some bugs in the tf compute.

in function _compute_tf, the divide is integer divide so that all the results become 0. So i add a type convertion float() there.
content_word_tf = dict((k, float(v) / content_words_count) for (k, v) in content_words_freq.items())
In _kl_divergence function, the w in summary_freq may not exist in doc_freq because of the filter method. Maybe we should add an 'if' before the sentence?

      for w in summary_freq:
            sum_val += doc_freq[w] * math.log(doc_freq[w] / summary_freq[w])

Unable to find `vcvarsall.bat` during pip install

copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\readme.txt -> build\lib.win32-3.5\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
error: Unable to find vcvarsall.bat

Sumy not correctly handling points which do not end sentences like "e.g." in english

Sumy is run by using a java processbuilder (essentially a command line call) to a python script, which i posted here: http://pastebin.com/9JDbPFVH. A test text exemplifying to issue can be found here: http://pastebin.com/gD65sS22

An example would be running the sumyapi.py file using arguments lex_rank, english, 3 and the provided text. The output is:

"Viral videos have become a staple of the social Web. The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people. Viral videos mainly contain humorous content such as bloopers in television shows (e.g. "

As is apparent, the algorithm has determined the "e.g." ends a sentence. This is not the case. The "(" infront of it has no effect, and replacing "e.g." with less formal "eg." yields the same result as well.

I've also checked against the german language in which the same construct "z.B." is correctly ignored by sumy in terms of sentence endings. I am told the python NLTK should have taken care of this. I am not nearly proficient enough with python to attempt to fix this.

LexRankSummarizer Bug(s)?

Hi, Miso Belica! :) I was running Sumy with the LexRank summarizer and I saw that the power_method always ends after one iteration. I think there is a bug: if you update p_vector with next_p before computing the new lambda_val, than the power_method ends after one iteration (in this way lambda_val is always zero after the first iteration).

In addition, I tried to see the values of the scores variable in the call function. These values are usually very few despite the document have a lot of sentences (maybe there is another bug but I am not sure of this).

I found your project very useful! Thank you! :)

Marco

Documentation and Examples for other summarizers

I see the documentation for the LSA summarizer and how you should use it in python. I was wondering if you could also add examples on how to use, the other types of summarization, in python?

Thanks,
Sam

Getting error while summarizing URLs.

When I am trying to pass any URLs with host-name "techcrunch.com", am getting an error in the code
parser = HtmlParser.from_url(url, Tokenizer(settings.LANGUAGE))
saying that SSLError: hostname 'techcrunch.com' doesn't match either of '*.wordpress.com', 'wordpress.com'

Cannot import name PlaintextParser

from sumy.parsers.html import PlaintextParser

ImportError: cannot import name PlaintextParser (or Tokenizer etc)

I've already tried pip uninstall and reinstalling, no, my script name is not sumy.

Python 2.7, on mac. Thanks for helping!

pip installation of latest broken

readme.rst does not exist although setup.py tries to open it, resulting in this error.

$ sudo pip install git+git://github.com/miso-belica/sumy.git

Collecting git+git://github.com/miso-belica/sumy.git
Cloning git://github.com/miso-belica/sumy.git to ./pip-UqFNmA-build
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-UqFNmA-build/setup.py", line 14, in
with open("README.rst") as readme:
IOError: [Errno 2] No such file or directory: 'README.rst'

Here is the errant line:
https://github.com/miso-belica/sumy/blob/dev/setup.py#L14

Here is the pull request to fix this:

#72

Sumy running the sample not working

I ran the program you put on github and I got a few errors. I checked if sumy was installed and it was.

Link to the screenshot

Lexrank scores all the sentences the same.

No matter what are the sentences, the scores returned by lexrank is the same for all the sentences.
i.e. 1/count(sentences)
lex_rank.py

file: lex_rank.py
...
[41] matrix = self._create_matrix(sentences_words, self.threshold, tf_metrics, idf_metrics)
[42] scores = self.power_method(matrix, self.epsilon)
print scores
>>> [0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329, 0.083333333333333329]

Simple summary evaluation framework

Add package for evaluation of summary with same basic methods. Possible algorithms below:

Alternative sentence tokenizer

I use NLTK to tokenize text into sentences & words. But that's big package. Maybe something smaller would be better. Something like https://bitbucket.org/trebor74hr/text-sentence/overview

error

After installing, when I run the code for using sumy in my project, I am getting the following error:
ImportError: No module named parsers.plaintext

Feature: Add command line option to use stdin or text instead of file as a source

Nice library.

Using a temporary file is a little hastle but it would be nice if you can just plug it in as a command line option or pipe it.

Pip install issue / lxml issue

Hi
My Pip install threw the following error:

Installing collected packages: sumy, breadability, docopt
Compiling /private/var/folders/f4/hyql3hq17mdcnpxwdg9hn5g80000gn/T/pip_build_SamPetulla/sumy/sumy/evaluation/__main__.py ...
  File "/private/var/folders/f4/hyql3hq17mdcnpxwdg9hn5g80000gn/T/pip_build_SamPetulla/sumy/sumy/evaluation/__main__.py", line 86
    dtext-rank build_text_rank(parser, language):
                             ^
SyntaxError: invalid syntax

Not sure if that mattered. I cannot use the module, though. I am throwing some lxml2 errors from IPython:

ImportError                               Traceback (most recent call last)
<ipython-input-5-3bffe08c3c10> in <module>()
      2 from __future__ import division, print_function, unicode_literals
      3 
----> 4 from sumy.parsers.html import HtmlParser
      5 from sumy.parsers.plaintext import PlaintextParser
      6 from sumy.nlp.tokenizers import Tokenizer

/Users/SamPetulla/anaconda/lib/python2.7/site-packages/sumy/parsers/html.py in <module>()
      4 from __future__ import division, print_function, unicode_literals
      5 
----> 6 from breadability.readable import Article
      7 from .._compat import urllib
      8 from ..utils import cached_property

/Users/SamPetulla/anaconda/lib/python2.7/site-packages/breadability/readable.py in <module>()
      8 from operator import attrgetter
      9 from pprint import PrettyPrinter
---> 10 from lxml.html.clean import Cleaner
     11 from lxml.etree import tounicode, tostring
     12 from lxml.html import fragment_fromstring, fromstring

/Users/SamPetulla/anaconda/lib/python2.7/site-packages/lxml/html/__init__.py in <module>()
     40     from urllib.parse import urljoin
     41 import copy
---> 42 from lxml import etree
     43 from lxml.html import defs
     44 from lxml.html._setmixin import SetMixin

ImportError: dlopen(/Users/SamPetulla/anaconda/lib/python2.7/site-packages/lxml/etree.so, 2): Library not loaded: libxml2.2.dylib
  Referenced from: /Users/SamPetulla/anaconda/lib/python2.7/site-packages/lxml/etree.so
  Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0

Cosine distance error

In lexrank, _cosine_distance is calculated like follows:
the numerator does not count repeated words
the denominator does count repeats repeated words

So in a sentence with words occurring more than once, with tfidf vector v:
v \dot v < 1.

How do you test the summarizer?

Do you use a particular dataset?

Some questions about the textrank

Hi, I have a question about the textrank module. As I know, the textrank is based on the pagerank algorithm. However, in the text_rank.py file, I just see the code which builds edges between sentences and don't seem to use iterative solution to calculate it. I don't know if I understand correctly, I am looking forward to your answer. Thx!

Error with upper case sentence

The following string will throw an error when you try to summarize with any sentence count.

string = 'YO! LONDON! WE'RE PLAYING THE ELECTRIC BALLBAG TONIGHT! COME HANG!'

To fix this, add the following to your repo:

string = string.lower() if string.isupper() else string

Add numpy as a dependency

In the lsa.py:

try:
    import numpy
except ImportError:
    numpy = None

try:
    from numpy.linalg import svd as singular_value_decomposition
except ImportError:
    singular_value_decomposition = None

and somewhere below this dependency is checked on __call__ of the summarizer.

I understand that in this case one doesn't have to install numpy if he isn't interested in numpy-based summarizers. But most python distributions people are using have numpy already installed anyway.
Why won't just add numpy as a dependency? Then we can remove that _ensure_dependencies_installed() thing at all.

Also, it could remove some tests like test_numpy_not_installed which essentially are testing python itself.

I could add pull request for it, if you want.

How to summarize .txt files.

Hi,

I've had a mess around with Sumy and it seems to be perfect for the small project I've been working on. However I've only been able to work with URL's. What code would i need to implement to summarize a block of text. Either saved in a variable or loaded from a .txt file.

Regards.

Summarising a 'narrative' rather than a well-formed topic

Not a problem but more a cheeky question to leech some of your knowledge - playing around with this text summarisation, it appears to be able to deal OK with something like a wiki article, but falls apart when given a forum thread - I assume its the difference between an article which is a 100% 'final understanding' of one or more topics, rather than a forum thread where understanding builds hopefully to some sort of conclusion (so presumably it has to actually understand what is being talked about to summarise), and in a much looser format.

I'm guessing actual understanding of what is discussed and then summarising is light years beyond the current capabiliy of sumy and in general libre text summarisation software - is this correct?

Thanks

What is the source paper for lsa summarizer?

Crash: LinAlgError: SVD did not converge

I am getting a crash during the singlar value decomposition in lsa.py:

u, sigma, v = singular_value_decomposition(matrix, full_matrices=False)

The exception is LinAlgError: SVD did not converge

I saved the input file here: http://pastebin.com/s0RNZ2J2

To Reproduce:

# "text" is the input text saved at http://pastebin.com/s0RNZ2J2
parser = PlaintextParser.from_string(text, LawTokenizer("english"))
# I'm using the porter2 stemmer, but I don't think that matters
summarizer = sumy.summarizers.lsa.LsaSummarizer(stem_word)
# This is the standard stop words list
summarizer.stop_words = get_stop_words("english")   
# We get the crash here when it calls down to lsa.py
summaries = summarizer(parser.document, sentences)

Multi document summarization

Hey does this package offer multi-document. If yes is there a example avialable somewhere?

stopwords not working

Hey
So I have reading in a text file of sentences
and when i do --stopwords=english.txt
it does not remove the stopwords, some are lower case and some are upper case

any ideas what i am doing wrong?

Readability Module

I get this error when I use HtmlParser.

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/sumy/parsers/html.py", line 6, in
from readability.readable import Article
ImportError: No module named readable

Was there a change in the readability module recently? Can you please check? Thanks

GAE ready

Can we make it be GAE ready? When I execute under GAE, it shows the following error.

mary/lib/nltk/tag/hunpos.py", line 16, in <module>
    from subprocess import Popen, PIPE
ImportError: cannot import name Popen

AssertionError: Number of words (141) should be larger than number of sentences (175)

I'm running into this assertion when using the lsa summarizer. Does anyone have some advice for avoiding it?

LexRankSummarizer, division by zero

When an upper case string is passed, tokenizer does not seem to work correctly and returns zero sentences and therefore inside the algorithm, division by zero exception is encountered. The issue will be fixed by lower casing the string prior to passing it to the algorithm.

sumy_eval

Hi,
pretty new to Python and the field of automated summarization but your sumy was a great introduction.

Currently I still lack a bit of usage/info on sum_eval - can you shed some light on this Topic and proper usage? (didn't really got it from sum_eval --help)

Error with using 'English' as language.

Upgraded Sumy and get this error upon running it.

The debugged program raised the exception unhandled TypeError
"new() takes exactly 2 arguments (1 given)"
File: /usr/local/lib/python2.7/dist-packages/sumy/nlp/stemmers/german.py, Line: 9
Break here?

does it support chinese?

I tried to set the variable "LANGUAGE = "chinese"" ,but it does not work, gives error.

LookupError: Stemmer is not available for language english.

Dear Mišo,

after installing sumy via pip (version 0.3.0), I tried to run your first usage example:

$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization
Traceback (most recent call last):
  File "/usr/local/bin/sumy", line 9, in <module>
    load_entry_point('sumy==0.3.0', 'console_scripts', 'sumy')()
  File "/usr/local/lib/python2.7/dist-packages/sumy/__main__.py", line 65, in main
    summarizer, parser, items_count = handle_arguments(args)
  File "/usr/local/lib/python2.7/dist-packages/sumy/__main__.py", line 102, in handle_arguments
    stemmer = Stemmer(language)
  File "/usr/local/lib/python2.7/dist-packages/sumy/nlp/stemmers/__init__.py", line 28, in __init__
    raise LookupError("Stemmer is not available for language %s." % language)
LookupError: Stemmer is not available for language english.

Any ideas?

Kind regards,
Arne

Sentence Compression?

Hi - do you have any implementations that do sentence compression?

UTF-8 issue

Python 2.7 on Win32. Problem with UnicodeDecode requires generic handling, I suppose. Input text is plain text, so no reason for error message.

IDF Calculation in lex_rank

In lex_rank.py, what is the motivation for calculating IDF for a word, w, as:
IDF(w) = N/k
where
N = total number of sentences in the document
k = the number of sentences containing the word

It is my understanding that:
IDF(w) = log (D/k)
where
D = the number of documents in the corpus
k = the number of documents containing w.

Are you treating each sentence as its own document? If so, shouldn't there be a log thrown in front?
I love this project by the way, thank you so much!

Summarization of DOCX & PDF formats

Hai,
First of all thanks for this wonderful module.
I have installed the sumy in my local and run the following command in my terminal
from sumy-master folder

/home/dev001/projects_new/demos/sumy-master$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization

I got the summarization text.

Then i tried for a document summarization
i created a simple test.txt and also a simple success.pdf file in in sumy-master folder and run the following command

sumy_eval edmundson  test.txt --language=english --file=http://192.168.1.86/dev001/projects_new/pivotv5/file/document/2013/09/c514e708cbbc198fc89111333d0ce53b.docx --format=docx

But am getting invalid syntax error . I tried with relative path to file as --file=home/dev001/projects_new/demos/sumy-master/success.pdf

Also the same invalid syntax error.

Then i tried the following command.

sumy edmundson  --language=english --file=/home/dev001/projects_new/pivotv5/file/document/2013/09/c514e708cbbc198fc89111333d0ce53b.docx --format=docx

I am getting error as UnicodeDecodeError: 'utf8' codec can't decode byte 0x87 in position 16: invalid start byte

Please help me. Kindly give me an example for document summarization.
It would a great relief for me.

Thanks in advance

Russian support

I've tried adding a punkt tokenizer in russian (from https://github.com/mhq/train_punkt) and a stopwords list from http://www.ranks.nl/stopwords/russian

It seems to be working fine, is this a correct approach?

Support for many languages (write doc about how to add it)

Michael, hello.

A few days ago I started using your library sumy. But in there is no support for Russian language. I had to make some changes. I took the basis czech.py and replaced implementation of the stem_word on use library Pymorphy2.

But here today I found http://stackoverflow.com/questions/5479333/summarize-text-or-simplify-text where you write "Feel free to open an issue or send a pull request if there is something you are missing." So I decided to contact you personally and ask you to make the support of the Russian language in the library sumy . Better than author to do so no one can.

And I would like to know, why not make a library with learning capability.For example as in this embodiment https://github.com/vighneshbirodkar/summarize/blob/master/summarize/Base/DocumentClass.py If yes - then write to me . I have some ideas on this subject.

Regards Alexander, [email protected]

ROUGE-L (Summary Level) how it works ?

Hi master. firstly im very grateful for that python implementation.. But i didnt understand how Summary Level ROUGE-L works via code?

Sentence level type of rouge, we can use more than a sentence for candidate summary.. After that we compute lcs reference summary - Candidate Sentence_1 and Candidate Sentence_2 respectively.

when i try to use it on command prompt, how i do write it?
or how should the structure of my candidate sentences be (in local.txt file)?
Just one line, or each line for each sentence?

Thanks.

miso-belica / sumy Goto Github PK

sumy's Issues

I got the summarization text.

Also the same invalid syntax error.

Recommend Projects

Recommend Topics

Recommend Org