Code Monkey home page Code Monkey logo

potara's Introduction

image image Build Status Coverage Status Requirements Status

What is this?

Potara is a multi-document summarization system that relies on Integer Linear Programming (ILP) and sentence fusion.

Its goal is to summarize a set of related documents in a few sentences. It proceeds by fusing similar sentences in order to create sentences that are either shorter or more informative than those found in the documents. It then uses ILP in order to choose the best set of sentences, fused or not, that will compose the resulting summary.

It relies on state-of-the-art (as of 2014) approaches introduced by Gillick and Favre for the ILP strategy, and Filippova for the sentence fusion.

It is compatible and tested with Python 3.5 and 3.6.

Install

The easy way

You should be able to install potara and its dependencies with pip

pip install potara

You can also clone this repo and use the requirements.txt file to install dependencies

further requirements

You will also need GLPK, which is used to obtain an optimal summary (example for Debian-based distro)

$ sudo apt-get install glpk

For Ubuntu-based distros you can use:

$ sudo apt-get install libglpk40

You can check that the install run successfully by cloning the repo and running

$ python setup.py test

If you have issues with install, you can check the .travis.yml file of the repo, which corresponds to a working build.

How To

Basically, you can use the following

from potara.summarizer import Summarizer
from potara.document import Document

s = Summarizer()

# Adding docs, preprocessing them and computing some infos for the summarizer
s.setDocuments([Document('data/' + str(i) + '.txt')
                for i in range(1,10)])
       
# Summarizing, where the actual work is done
s.summarize()

# You can then print the summary
print(s.summary)

There's some preprocessing involved and a sentence fusion step, but I made it easily tunable. Preprocessing may take a while (a few minutes) since there is a lot going on under the hood. Default parameters are currently set for summarizing ~10 documents. You can summarize a smaller amount of documents by tweaking the "minbigramcount" parameter of the summarizer :

s = Summarizer(minbigramcount=2)

Summarizing less than 4 documents would probably yield a bad summary.

Similarity models

Potara relies on similarity scores between sentences. These scores can be shallow using a cosine similarity, or "deep" using gensim Word2Vec semantic representation of words. For the second use case, you'll want to train your own model or use pretrained models. You may contact me if you want to use potara that way, and I may create a tutorial on the matter for the occasion.

potara's People

Contributors

dependabot[bot] avatar jbargu avatar sildar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

potara's Issues

Need Help on Importing Docs

Hi. I'm interested in trying potara. I've cloned it & imported stop words. Can you elaborate on exactly where the documents go? Do I have to create a particular directory? Looks like they have to be in a .txt format. What about inside that txt file...any formatting requirements? What about the title & article executive summary that usually comes with an article? Can you provide an example with real documents?

Thanks! Ty

Paper

Can you please send me the reference paper for this code.
Thanks

ModuleNotFoundError UserDict

Hi,
I tried to use this package but when I run the sample code I get this error:

Traceback (most recent call last): File "C:\Users\xxx\scripts\multi_summarizer.py", line 4, in <module> from summarizer import Summarizer File "C:\Users\xxx\scripts\summarizer.py", line 22, in <module> import gensim File "C:\Users\xxx\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\__init__.py", line 7, in <module> from gensim import utils, matutils, interfaces, corpora, models, similarities File "C:\Users\xxx\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\corpora\__init__.py", line 12, in <module> from .dictionary import Dictionary File "C:\Users\xxx\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\corpora\dictionary.py", line 22, in <module> import UserDict ModuleNotFoundError: No module named 'UserDict'

How can I solve this?

Python version used:
Python 3.7.0b4

Not printing the summary.

I have tried using the summarizer with the following text in the file data/1.txt
Here is the text I am using for summarization:
https://gist.github.com/JafferWilson/50f3cb96beb5e78a6d731abc8ba1bc82

I have used the following command for running the program:

>>> from summarizer import Summarizer
>>> import document
>>> s = Summarizer()
>>> s.setDocuments([document.Document('data/1.txt')])
>>> s.summarize()
GLPSOL: GLPK LP/MIP Solver, v4.57
Parameter(s) specified in the command line:
 --cpxlp /tmp/19846-pulp.lp -o /tmp/19846-pulp.sol
Reading problem data from '/tmp/19846-pulp.lp'...
11 rows, 7 columns, 20 non-zeros
6 integer variables, all of which are binary
25 lines were read
GLPK Integer Optimizer, v4.57
11 rows, 7 columns, 20 non-zeros
6 integer variables, all of which are binary
Preprocessing...
4 hidden packing inequaliti(es) were detected
8 rows, 6 columns, 16 non-zeros
6 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled
Constructing initial basis...
Size of triangular part is 8
Solving LP relaxation...
GLPK Simplex Optimizer, v4.57
8 rows, 6 columns, 16 non-zeros
*     0: obj =  -0.000000000e+00 inf =   0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
Integer optimization begins...
+     0: mip =     not found yet <=              +inf        (1; 0)
+     0: >>>>>   0.000000000e+00 <=   0.000000000e+00   0.0% (1; 0)
+     0: mip =   0.000000000e+00 <=     tree is empty   0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
Time used:   0.0 secs
Memory used: 0.0 Mb (51924 bytes)
Writing MIP solution to '/tmp/19846-pulp.sol'...
>>> print s.summary
[]

As you can see there is nothing in the output of summary. Please help me if you have any solution for this sort of problem. Anyone?

Loading documents

When I create a list of documents from Document class and pass it to the Summarizer it takes much time running into minutes. Furthermore, when I call summarize, it throws an error that says that the documents in the summarizer are actually strings not instances of the Document class. What's more, for the same amount of documents I pass to it, len(summarizer.documnets) returns different answers.
This is my code

sum = Summerizer()
sum.setDocuments([Document('tests/testdata/mytest/article' + str(i))
       for i in range(1,3)])
# my articles are: article1, artilce2, artilce3
sum.summerize()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/kenneth/PycharmProjects/potara/potara/summarizer.py", line 405, in summarize
    self.clusterSentences()
  File "/home/kenneth/PycharmProjects/potara/potara/summarizer.py", line 274, in clusterSentences
    for sentence in doc.stemTokens]
AttributeError: 'str' object has no attribute 'stemTokens'

OSError: [WinError 6] The handle is invalid

Hello,
I tried to test the library on 4 documents ["1.txt", "2.txt", "3.txt", "4.txt"] using this code:

from potara.summarizer import Summarizer
from potara.document import Document
import nltk
from nltk import word_tokenize, sent_tokenize

s = Summarizer(minbigramcount=4)

# Adding docs, preprocessing them and computing some infos for the summarizer
s.setDocuments([Document("data/" + str(i) + '.txt')
                for i in range(1, 5)])

# Summarizing, where the actual work is done
s.summarize()

# You can then print the summary
print(s.summary)

But the I got this:

C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\iZeid\Desktop\hatemylife\hate\lib\site-packages\pulp\pulp.py:1199: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
['The biggest news with Elon Musk โ€™ s tweet is that SpaceX eventually plans to produce fuel this way on Earth.']
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Exception ignored in: <function Pool.__del__ at 0x0000017FFF66A550>
Traceback (most recent call last):
  File "c:\python38\lib\multiprocessing\pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
del__
    self._change_notifier.put(None)
  File "c:\python38\lib\multiprocessing\queues.py", line 365, in put
    self._writer.send_bytes(obj)
  File "c:\python38\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python38\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid

What should I do to solve this problem?

Notes:

  • I use Windows 10

  • I get same result on a virtual environment.

  • Data folder contains:

  1. english-left3words-distsim.tagger
  2. stanford-postagger.jar
  3. english.pickle
  4. The 4 documents ["1.txt", "2.txt", "3.txt", "4.txt"]

running the code

Hello,
Can you please tell me how should I run this code. I have installed all dependencies. The instructions are not very clear. Should I run the commands in python

Regards
Shiju

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.