misja / python-boilerpipe Goto Github PK

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

License: Other

Python 100.00%

python-boilerpipe's Introduction

python-boilerpipe

A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies:

jpype
chardet

The boilerpipe jar files will get fetched and included automatically when building the package.

Installation

Checkout the code:

git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe

virtualenv

virtualenv env
source env/bin/activate
pip install -r requirements.txt
python setup.py install

Fedora

sudo dnf install -y python2-jpype
sudo python setup.py install

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argument extractor, being one of the available boilerpipe extractor types:

DefaultExtractor
ArticleExtractor
ArticleSentencesExtractor
KeepEverythingExtractor
KeepEverythingWithMinKWordsExtractor
LargestContentExtractor
NumWordsRulesExtractor
CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()

For KeepEverythingWithMinKWordsExtractor we have to specify kMin parameter, which defaults to 1 for now:

extractor = Extractor(extractor='KeepEverythingWithMinKWordsExtractor', url=your_url, kMin=20)

python-boilerpipe's People

Contributors

Stargazers

Watchers

Forkers

grangier osman8491 mlaprise rkrzr jiangwei1221 fanfannothing tlvince eadmundo chundong ptwobrussell netconstructor scraping-xx bigdata-tools big-data listings-xx web5design mapping openscripts-xx kanarinka rshiva patelkunal marcoippolito mjroach rebirth2nd liulu081227 nubela hwm18 chagge lostabaddon sorpaas tomasdelvechio ultimate010 dhamaniasad xuehui1991 mathlover777 benpryke egoughnour skgit chrismoulton juliascript changxiaocui blaz3 hanhanwu adrianhust shareaholic gboubaker pb-nickames bigwayseo tuyendothanh mstaniek thoughtx emam1921990 mdfahad777 tuxdna najeeb-iquanti surya-shodan arachnys yun-li michaelviu rmax-contrib gutfeeling yuany chiragchhatbar-zz aniav dstrbad manjunath-s thanhlct 2youngkim nishanthpp93 mahdi-saberi madity shawnmjones ironiksk zergey m4team prasastoadi datar-ai osmanmutlu leosonh weichaojie squirro jsantoso-stts dante42maru d0tn3t yutaolife corbolais vishttt motazsaad llzhi001 congson1293 sagrawal1993 iasnezhkov zkt12 robertaparisi reflexiveio youko70s dlts85 madhulekha bitextor bartmachielsen

python-boilerpipe's Issues

Hi,
I am trying to use boilerpipe.extract in my Canopy.
I already followed these instructions to set the JAVA_HOME (https://confluence.atlassian.com/display/DOC/Setting+the+JAVA_HOME+Variable+in+Windows).
Also I installed boilerpipe, charade and jpype1 using the Canopy command.

Nevertheless, when I run the code I get the error: “ImportError: No module named jpype”
I have tried installing jpype but get this message:

Is it safe to allow such installation?
Is there another way to get the boilerpipe.extract to run on Canopy?

Does anyone use KeepEverythingWithMinKWordsExtractor ?

Hi and thanks for your work !

Using All extractors works perfectly except with KeepEverythingWithMinKWordsExtractor which seems normal because I believe we should pass and extra argument which would be min keyword length ? Here is the error thrown

Traceback (most recent call last):
  File "python-boilerpipe.py", line 129, in <module>
    main(sys.argv[1:])
  File "python-boilerpipe.py", line 49, in main
    text = extract_text_html(html, extractor, outputtype)
  File "python-boilerpipe.py", line 102, in extract_text_html
    extractor = Extractor(extractor=extractor, html=html)
  File "/usr/local/lib/python2.7/dist-packages/boilerpipe/extract/__init__.py", line 67, in __init__
    "de.l3s.boilerpipe.extractors."+extractor).INSTANCE
AttributeError: type object 'de.l3s.boilerpipe.extractors.KeepEverythingWithMin' has no attribute 'INSTANCE'

Does anyone have syntax to launch this extractor ? Thanks !

"not a gzip file" error

I'm just obtaining a ReadError when I install this package through pip. Anyone can help? Thanks!

$ pip install -r requirements.txt
Collecting JPype1 (from -r requirements.txt (line 1))
  Downloading JPype1-0.6.2.tar.gz (147kB)
    100% |████████████████████████████████| 153kB 1.1MB/s 
Collecting chardet (from -r requirements.txt (line 2))
  Using cached chardet-3.0.4-py2.py3-none-any.whl
Building wheels for collected packages: JPype1
  Running setup.py bdist_wheel for JPype1 ... done
  Stored in directory: 
Successfully built JPype1
Installing collected packages: JPype1, chardet
Successfully installed JPype1-0.6.2 chardet-3.0.4
$ python setup.py install
Traceback (most recent call last):
  File "setup.py", line 26, in <module>
    download_jars(datapath=DATAPATH)
  File "setup.py", line 20, in download_jars
    tar = tarfile.open(tgz_name, mode='r:gz')
  File "/users/pxie1/miniconda2/lib/python2.7/tarfile.py", line 1693, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/users/pxie1/miniconda2/lib/python2.7/tarfile.py", line 1751, in gzopen
    raise ReadError("not a gzip file")
tarfile.ReadError: not a gzip file

I wonder how to fix it. I try to install some other python packages like tqdm and it run smoothly.

Image Extraction does not work

Hello, the command:

extractor.getImages() leads to the following error:

Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/boilerpipe-1.2.0-py2.7.egg/boilerpipe/extract/init.py", line 74, in getImages
"de.l3s.boilerpipe.sax.ImageExtractor").INSTANCE
File "/Library/Python/2.7/site-packages/jpype/_jclass.py", line 54, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.ExceptionPyRaisable: java.lang.Exception: Class de.l3s.boilerpipe.sax.ImageExtractor not found

Any ideas?

some urls will not work with celery

Hi,

I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.

code:

from celery import Celery

from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return

I've tried everything but editing the java code and found the following:

the task / boilerpipe stops working at line 70 or so in the Extractor (init.py),
"self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
it simply doesn't give back the parsed text and then the task times out.
Please understand It works perfectly with some URL's within celery, others timeout.
If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)
if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.

So: this works, but is not good code and highly unwanted I think:

class taskclass(celery.Task):

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task (base=taskclass)
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return

updated JPype1
updated nekohtml
cannot find any other instance of this on the internet.

I hope you can help me,

Kindest regards,

Roland Zoet

Empty html causes exception from Extractor

I've noticed inconsistency in argument checking in Extractor class, which causes an exception, if an empty string is passed to html argument. Erroneous code is: kwargs.get('html') (https://github.com/misja/python-boilerpipe/blob/master/src/boilerpipe/extract/__init__.py#L48). It does not only check for the argument presence, but evaluates input line to bool (in case of string, does the string have zero-length or no). Therefore, it raises an exception even if the argument is supplied.
As I think, the correct way to test for kwarg should be 'html' in kwargs, and an empty html should be valid.

python-boilerpipe setup.py fails using Python3

setup.py fails due to changes in urllib package and unicode() function.
With the following changes, build succeeds on Windows8.1/Cygwin and Mac OS X 10.9.5.
Other software: Oracle JDK 8u20, latest build of JPype1-py3.
Python3 changes to python-boilerpipe-master/src/boilerpipe/extract/init.py
import urllib.request # line 2
request = urllib.request.Request(kwargs['url'], headers=self.headers) # line 35
connection = urllib.request.urlopen(request) # line 36
self.data = str(self.data, encoding) # line 41
self.data = str(self.data, charade.detect(self.data)['encoding']) # line 45

Import Error in boilerpipe python 2.7 . JVM isn't starting

My JAVA PATH is correct and all packages regarding boilerpipe is installed but still can't understand the issue. Here is the error:

import boilerpipe

Traceback (most recent call last):
File "<pyshell#0>", line 1, in
import boilerpipe
File "C:\Python27\lib\site-packages\boilerpipe__init__.py", line 10, in
jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=%s" % os.pathsep.join(jars))
File "C:\Python27\Lib\site-packages\JPype1-0.6.0-py2.7-win32.egg\jpype_core.py", line 50, in startJVM
_jpype.startup(jvm, tuple(args), True)
RuntimeError: Unknown Exception

difference with java version

Hi,
I have a question, if i extract text from "http://www.ilsole24ore.com/art/finanza-e-mercati/2013-04-14/borsa-luce-telecom-attesa-165343.shtml?uuid=AbWL47mH" with java version of boilrepipe and compare it with the result of python version, these are different, why? I used http://boilerpipe-web.appspot.com/ as java implementation. Thanks

Segmentation fault (core dumped) on import

I am obtaining a "Segmentation fault (core dumped)" whenever I tried to import boilerpipe. It just happened in all my installations suddenly, as they were working fine just a few hours ago. The segmentation fault occurs just by typing "import boilerpipe" on Ubuntu 16.04.2 LTS machines. It seems to work fine on MacOS 10.11. I tried to double the heap memory on JVM but it did not work. Any suggestions?

konlpy & boilerpipe mutually creates exceptions

Hi @misja.
I'm the maintainer of konlpy, a Korean NLP package for Python.

konlpy has recently been issued that using konlpy with python-boilerpipe creates exceptions (konlpy/konlpy#66), and I figured it was because both packages use JPype1.
Specifically, boilerpipe inits jvm on import while konlpy inits jvm on class method calls, so the reported case errored on konlpy, because konlpy did not have classpaths that it needed. (On the other hand, if you run konlpy, and then attempt to import boilerpipe, boilerpipe will have the exeception.)

I saw on the project README that you no longer wanted to maintain python-boilerpipe, but was wondering if we could discuss how this problem could be easily fixed. Then maybe I can create a patch for both libraries. Thanks.

Error: Process finished with exit code -1073741819 (0xC0000005)

Im getting following error while trying the sample code provided in pyCharm CE. Plz advise how to resolve this.

Process finished with exit code -1073741819 (0xC0000005)

during debug i found that execution stucks at init.py at line 13

InputSource        = jpype.JClass('org.xml.sax.InputSource')

Im using the following sample code:

from boilerpipe.extract import Extractor

extractor = Extractor(extractor='ArticleExtractor', url='http://edition.cnn.com/2017/05/31/asia/kabul-explosion-hits-diplomatic-area/index.html')
extractor = Extractor(extractor='KeepEverythingWithMinKWordsExtractor', url=your_url, kMin=20)

extracted_text = extractor.getText()
extracted_html = extractor.getHTML()

Thanks.

kernel crashes

The moment we import Extractor, the python 2.7 kernel crashes.

I use python-boilerpipe on win10 but it doesn't work

Hello!I installed the dependecies jpype, chardet on my anaconda python(version 3.6),and I also installed python-boilerpipe on my my anaconda python.
My JAVA_HOME is C:\Program Files (x86)\Java\jdk1.7.0_55
But When I run the code, for example:
from boilerpipe.extract import Extractor

print("Hello world")

And My program was crashed!
D:\anaconda\python.exe "D:\PyCharm 2017.1.5\helpers\pydev\pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 16219 --file D:/python/testboilerpipe/test.py
pydev debugger: process 8500 is connecting

Connected to pydev debugger (build 172.3968.37)

Process finished with exit code -1073741819 (0xC0000005)

I don't understand why.anyone can tell me the reason ?Thank you!

extractor.getImage raises an exception

Arch Linux, OpenJDK7, Python 2.7.5

(Is it in issue just of OpenJDK7???)

extractor = Extractor(extractor='ArticleExtractor', url='http://www.faz.net/aktuell/wissen/physik-chemie/digitale-vernetzung-die-masse-macht-s-11916683.html')
html = extractor.getHTML()
images = extractor.getImages()

java.lang.ExceptionPyRaisable Traceback (most recent call last)
in ()
----> 1 images = extractor.getImages()

/usr/lib/python2.7/site-packages/boilerpipe/extract/init.pyc in getImages(self)
72 def getImages(self):
73 extractor = jpype.JClass(
---> 74 "de.l3s.boilerpipe.sax.ImageExtractor").INSTANCE
75 images = extractor.process(self.source, self.data)
76 jpype.java.util.Collections.sort(images)

/usr/lib/python2.7/site-packages/jpype/_jclass.pyc in JClass(name)
51 jc = _jpype.findClass(name)
52 if jc is None :
---> 53 raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
54
55 return _getClassFor(jc)

java.lang.ExceptionPyRaisable: java.lang.Exception: Class de.l3s.boilerpipe.sax.ImageExtractor not found

LookupError when giving url as one that is already saved on the disk (file:///)

Hello,
Firstly thank you for python-boilerpipe.
When i use wget to get the page http://www.flipkart.com/dell-xps-13-laptop-2nd-gen-ci7-4gb-256gb-ssd-win7-hp/p/itmdg387gmhzhx3m and save it on my disk and then try to open it with python-boilerpipe using the code

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='DefaultExtractor', url="file:///home/code/code/opinion/boilerpipe-binary/itmdg387gmhzhx3m")
#extractor = Extractor(extractor='DefaultExtractor', url="http://www.flipkart.com/dell-xps-13-laptop-2nd-gen-ci7-4gb-256gb-ssd-win7-hp/p/itmdg387gmhzhx3m")
extracted_text = extractor.getText()
extracted_html = extractor.getHTML()
print extracted_html

I get the following error

Traceback (most recent call last):
  File "htmlExtractor.py", line 2, in <module>
    extractor = Extractor(extractor='DefaultExtractor', url="file:///home/code/code/opinion/boilerpipe-binary/itmdg387gmhzhx3m")
  File "/usr/local/lib/python2.7/dist-packages/boilerpipe/extract/__init__.py", line 41, in __init__
    self.data = unicode(self.data, encoding)
LookupError: unknown encoding: text/plain

I have already setup a spider with scrapy, so processing files on the disk is very important for me.

Warm regards,
Harish Badrinath

Cannot run `python setup.py`

I am using Anaconda on 2.7 python version.

I tried also pip install setup.py after reading this:

http://stackoverflow.com/questions/8295644/pypi-userwarning-unknown-distribution-option-install-requires

But got:

Could not find a version that satisfies the requirement setup.py (from versions: ) No matching distribution found for setup.py

Encoding Issues - UnicodeDecodeError: 'utf8' codec can't decode byte

Hey guys,

First of all thanks for python-boilerpipe

Trying to use Boilerpipe but can't extract properly some documents...

from boilerpipe.extract import Extractor
extractorType="DefaultExtractor"
sourceUrl = 'http://www.indiatimes.com/news/india/arvind-kejriwal-to-seek-political-sanyas-127620.html'
extractor = Extractor(extractor=extractorType, url=sourceUrl)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/boilerpipe/extract/init.py", line 41, in init
self.data = unicode(self.data, encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 53647: invalid start byte

The document seems to be having some non-utf8 characters... which do not seem to parse well... Any workaround for the problem?

java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls

I need to extract article bodies from raw htmls. My code is as simple as:

for html in htmls:
    extractor = Extractor(extractor='ArticleExtractor', html=article)
    extractor.getHTML()

After calling a method of it, e.g. 10K times, I get java.lang.OutOfMemoryError error:

Traceback (most recent call last):
  File "test.py", line 228, in <module>
    extractor.getHTML()
  File "/Users/macuser/.virtualenvs/bro/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 70, in getHTML
    return highlighter.process(self.source, self.data)
jpype._jexception.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: Java heap space

I looked into the code and it looks like creating BoilerpipeSAXInput, HTMLHighlighter and other java instances causes this problem. Is there a way to fix this issue?

To reproduce this without 10K articles, simply reduce the heap size in boilerpipe.__init__:

MAX_JVM_HEAP_SIZE_MBYTES = 4

if jpype.isJVMStarted() != True:
    jars = []
    for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):
        for nm in files:
            jars.append(os.path.join(top, nm))

    jvm_args = [
        '-Xmx%dM' % MAX_JVM_HEAP_SIZE_MBYTES,
        "-Djava.class.path=%s" % os.pathsep.join(jars)
    ]
    jpype.startJVM(jpype.getDefaultJVMPath(), *jvm_args)

Implementing the JSON method?

This could be pretty cool, and I see it's available in boilerpipe itself.

JRE sez: SIGSEGV (0xb)

I'm getting this when running my python project that uses DefaultExtractor

spz@terra-mob:~$ export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/
spz@terra-mob:~$ anchorbot/anchorbot.linux -v
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f30f3bf7506, pid=13428, tid=139848018867968
#
# JRE version: 7.0-b147
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0-b17 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [_jpype.so+0x56506]  JPJavaEnv::FindClass(char const*)+0x36
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/spz/hs_err_pid13428.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f30f3bf7506, pid=13428, tid=139848018867968
#
# JRE version: 7.0-b147
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0-b17 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [_jpype.so+0x56506]  JPJavaEnv::FindClass(char const*)+0x36
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/spz/hs_err_pid13428.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

Running it in python's REPL works fine though.

Issue python-boilerpipe on docker

I trying to use python-boilerpipe in docker, but the problem is the code block in the line

extractor = Extractor(extractor='ArticleExtractor',url=link, headers=self.headers)

without returning nothing, knowing that with out docker it work fine

my dockerfile looks like:

FROM tiangolo/uwsgi-nginx-flask:python3.6

RUN pip3 install --upgrade pip
# copy over our requirements.txt file
COPY requirements.txt /tmp/
WORKDIR /tmp/

# Install OpenJDK-11
# Install "software-properties-common" (for the "add-apt-repository")
RUN apt-get update
RUN apt-get install -y software-properties-common 
RUN add-apt-repository ppa:openjdk/ppa
RUN apt-get install -y openjdk-11-jdk && \
    apt-get install -y ant && \
    apt-get clean;

# Fix certificate issues
RUN apt-get install ca-certificates-java && \
    apt-get clean && \
    update-ca-certificates -f;

# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/
RUN export JAVA_HOME
RUN export PATH=$PATH:/usr/lib/jvm/java-11-openjdk-amd64/bin
#Check java
RUN echo $JAVA_HOME

# boilerpipe
RUN git --version
RUN git config --global http.sslverify false
RUN git clone https://github.com/misja/python-boilerpipe.git
WORKDIR /tmp/python-boilerpipe/
RUN pip3 install -r requirements.txt
RUN python3 setup.py install



RUN pip3 install -r /tmp/requirements.txt


# copy over our app code
WORKDIR /app
COPY ./app /app


Expose 80/tcp

after some debugging i found that the line that cause that is

python-boilerpipe/src/boilerpipe/extract/__init__.py

Line 78 in ab3694d

self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()

any idea how to solve this problem ?

getImages no longer works

@misja :
The boilerpipe library 1.2.0 no longer has the ImageExtractor class [de.l3s.boilerpipe.sax.ImageExtractor]
It might be worth removing this feature or commenting it out for now (even though I know you have not included it in the documentation in the Readme.rst).

If you know of another way to retrieve it using python-boilerpipe, I'd be happy to hear because I would like the images.

Cheers

Chardet -> Charade?

Chardet was created by digitally dead Mark Pilgrim. Active fork of this software continues to exist under name Charade.

Import problem

Apologies for this somewhat novice question but when I type:

from boilerpipe.extract import Extractor

I get the following output:

Traceback (most recent call last):
File "", line 1, in
File "build\bdist.win32\egg\boilerpipe\extract__init__.py", line 12, in
File "C:\Python26\lib\site-packages\jpype_jclass.py", line 54, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.ExceptionPyRaisable: java.lang.Exception: Class de.l3s.boilerp
ipe.sax.HTMLHighlighter not found

What might be causing this error and how could I fix it?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 20: ordinal not in range(128)

Hi,
when running this code on my Ubuntu 12.04 micro-instance:

!/usr/bin/python

import boilerpipe

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url="http://europe.wsj.com/home-page")
extracted_text = extractor.getText()
print extracted_text
extracted_html = extractor.getHTML()

I get this error:
python boilerpipeTrial.py
Traceback (most recent call last):
File "boilerpipeTrial.py", line 9, in
print extracted_text
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 20: ordinal not in range(128)

where line 9 is: print extracted_text

Would please give me some hints on how to solve it?

Kind regards.
Marco

Import Error

I am having problems importing boilerpipe:

In [2]: from boilerpipe.extract import Extractor

RuntimeError Traceback (most recent call last)
in ()
----> 1 from boilerpipe.extract import Extractor

/usr/local/lib/python2.7/dist-packages/boilerpipe/init.py in ()
8 for nm in files:
9 jars.append(os.path.join(top, nm))
---> 10 jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=%s" % os.pathsep.join(jars))

/usr/local/lib/python2.7/dist-packages/jpype/_core.pyc in startJVM(jvm, *args)
42
43 def startJVM(jvm, *args) :
---> 44 _jpype.startup(jvm, tuple(args), True)
45 _jclass._initialize()
46 _jarray._initialize()

RuntimeError: Unable to load DLL [/usr/java/jre1.5.0_05/lib/i386/client/libjvm.so], error = /usr/java/jre1.5.0_05/lib/i386/client/libjvm.so: cannot open shared object file: No such file or directory at src/native/common/include/jp_platform_linux.h:45

It seem to have this path to jre 1.5 hardcoded... I have openjdk 7 installed

jpype._jclass.java.lang.NoClassDefFoundError: java.lang.NoClassDefFoundError: de/l3s/boilerpipe/sax/ImageExtractor

There is no imageExtractor, how to solve it ?

Installation via pip fails

I am attempting to install boilerpipe on a machine running Ubuntu 12.04 via pip install boilerpipe.

I get the following output:

Downloading/unpacking boilerpipe
Downloading boilerpipe-1.2.0.0.tar.gz (1.3MB): 1.3MB downloaded
Running setup.py egg_info for package boilerpipe

Downloading/unpacking JPype1 (from boilerpipe)
Downloading JPype1-0.5.5.2.tar.gz (143kB): 143kB downloaded
Running setup.py egg_info for package JPype1

Downloading/unpacking charade (from boilerpipe)
Downloading charade-1.0.3.tar.gz (168kB): 168kB downloaded
Running setup.py egg_info for package charade

Installing collected packages: boilerpipe, JPype1, charade
Running setup.py install for boilerpipe

Running setup.py install for JPype1
building '_jpype' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -Isrc/native/common/include -Isrc/native/python/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include/linux -I/usr/include/python2.7 -c src/native/common/jp_platform_win32.cpp -o build/temp.linux-i686-2.7/src/native/common/jp_platform_win32.o
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -Isrc/native/common/include -Isrc/native/python/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include/linux -I/usr/include/python2.7 -c src/native/common/jp_referencequeue.cpp -o build/temp.linux-i686-2.7/src/native/common/jp_referencequeue.o
In file included from src/native/common/include/jpype.h:80:0,
from src/native/common/jp_referencequeue.cpp:1:
src/native/common/include/jp_utility.h:20:17: fatal error: jni.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1
Complete output from command /usr/bin/python -c "import setuptools;file='/tmp/pip_build_root/JPype1/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-mJBmAB-record/install-record.txt --single-version-externally-managed:
running install

running build

running build_py

creating build

creating build/lib.linux-i686-2.7

creating build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_gui.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jexception.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jpackage.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_pykeywords.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_darwin.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_core.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jobject.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jclass.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jarray.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jproxy.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jcollection.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_linux.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/JClassUtil.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/nio.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_properties.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_jwrapper.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/findjvm.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_windows.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/reflect.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/init.py -> build/lib.linux-i686-2.7/jpype

copying src/python/jpype/_refdaemon.py -> build/lib.linux-i686-2.7/jpype

creating build/lib.linux-i686-2.7/jpype/awt

copying src/python/jpype/awt/init.py -> build/lib.linux-i686-2.7/jpype/awt

creating build/lib.linux-i686-2.7/jpype/awt/event

copying src/python/jpype/awt/event/WindowAdapter.py -> build/lib.linux-i686-2.7/jpype/awt/event

copying src/python/jpype/awt/event/init.py -> build/lib.linux-i686-2.7/jpype/awt/event

creating build/lib.linux-i686-2.7/jpypex

copying src/python/jpypex/init.py -> build/lib.linux-i686-2.7/jpypex

creating build/lib.linux-i686-2.7/jpypex/swing

copying src/python/jpypex/swing/AbstractAction.py -> build/lib.linux-i686-2.7/jpypex/swing

copying src/python/jpypex/swing/pyutils.py -> build/lib.linux-i686-2.7/jpypex/swing

copying src/python/jpypex/swing/init.py -> build/lib.linux-i686-2.7/jpypex/swing

running build_ext

building '_jpype' extension

creating build/temp.linux-i686-2.7

creating build/temp.linux-i686-2.7/src

creating build/temp.linux-i686-2.7/src/native

creating build/temp.linux-i686-2.7/src/native/common

creating build/temp.linux-i686-2.7/src/native/python

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -Isrc/native/common/include -Isrc/native/python/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include/linux -I/usr/include/python2.7 -c src/native/common/jp_platform_win32.cpp -o build/temp.linux-i686-2.7/src/native/common/jp_platform_win32.o

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -Isrc/native/common/include -Isrc/native/python/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include -I/usr/lib/jvm/java-1.7.0-openjdk-i386/include/linux -I/usr/include/python2.7 -c src/native/common/jp_referencequeue.cpp -o build/temp.linux-i686-2.7/src/native/common/jp_referencequeue.o

In file included from src/native/common/include/jpype.h:80:0,

             from src/native/common/jp_referencequeue.cpp:1:

src/native/common/include/jp_utility.h:20:17: fatal error: jni.h: No such file or directory

compilation terminated.

error: command 'gcc' failed with exit status 1

Cleaning up...
Command /usr/bin/python -c "import setuptools;file='/tmp/pip_build_root/JPype1/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-mJBmAB-record/install-record.txt --single-version-externally-managed failed with error code 1 in /tmp/pip_build_root/JPype1
Storing complete log in /home/justin/.pip/pip.log

The problem seems to be that it is following this hard-coded path:

src/native/common/include

My JAVA_HOME is pointing to /usr/lib/jvm/java-1.7.0-openjdk-i386

Any help is appreciated.

Thank you

ImportError: No module named boilerpipe

Hi, getting above error. Setup is Mac OS X 10.6.8, with virtualenv, Python 2.7...

I narrowed it down to the following line in init.py:

for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):

the imp module is having problems finding modules in zipped eggs... (not sure if anyone else has had this problem)

Solution is to set zip_flags=False in setup.py:

setup(
name = 'boilerpipe',
version = version,
packages = find_packages('src'),
package_dir = {'':'src'},
install_requires = ['jpype', 'chardet'],
package_data = {
'boilerpipe': package_data('boilerpipe')
},
author = "Misja Hoebe",
author_email = "[email protected]",
description = "Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages",
zip_files = False
)

Can't find Java runtime even though JPype installed

Running on OS X 10.10 Yosemite. JAVA_HOME is set:
$ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home

JPype install completed:
...
Installed /usr/local/lib/python2.7/site-packages/JPype1-0.5.7-py2.7-macosx-10.10-x86_64.egg
Processing dependencies for JPype1==0.5.7
Finished processing dependencies for JPype1==0.5.7

When importing boiler pipe-extract, got:

from boilerpipe.extract import Extractor
No Java runtime present, requesting install.

Any idea why this happened?

Thanks.

Boilerpipe fails to extract certain urls with 406 Error

I'm trying to extract content for website but boilerpipe fail with this error ..

File "/local/lib/python2.7/site-packages/boilerpipe/extract/init.py", line 36, in init
connection = urllib2.urlopen(request)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 406: Not Acceptable

First i thought error is coming from urllib2 . So i tried it with urllib2 independently but it was working fine.
These are the few url where its failing ..
http://mlmgod.com/onlinedeals/20-off-on-max-3000-ebay-rs-0-0-other-free-deals-coupons/
http://swapmyapp.com/how-to/use-microsoft-office-for-iphone/
http://www.techeyetech.com/micromax-canvas-tube-specifications-features-price-canvas-tube-palm-theater.html

Release on PyPI

Hi,

I hereby would like to request a release on PyPI.

Pleeease :]

Documentation request (not a bug)

Hello, thanks for this wrapper, this is wonderfully useful. Installing it took me a few minutes, maybe these notes will help someone else (and perhaps could be added to the README?)

JPypy is a dependency, pip cannot install it (in Jan 2013), get it directly from the Download link:
http://jpype.sourceforge.net/

JAVA_HOME needs to be configured correctly. On Ubuntu 12.04 I use:
$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
and I got to this via:
$ whereis java # -> /usr/bin/java
$ ls -la /usr/bin/java # -> /usr/bin/java -> /etc/alternatives/java
$ ls -la /etc/alternatives/java # -> /etc/alternatives/java -> /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java
and then I chopped off the JRE part so JAVA_HOME points to ./include, ./lib, ./bin etc

I believe this JDK package is needed too: "openjdk-6-jdk (6b24-1.11.5-0ubuntu1~12.04.1)".

Now install JPype:
./JPype-0.5.4.2/ $ python setup.py install
and then install python-boilerplate
./python-boilerpipe/ $ python setup.py install

boilerpipe hangs in multiprocessing program

When I try to run boilerpipe in a multiprocessing pool, the code hangs after processing some links.

I have tried running the code in normal loops, and that works fine.

unable to find module, is path affected by OS language?

I used your files to install boilerpipe with sucess:

I had previously created the JAVA_HOME variable under the (x86) Program Files.

When I try to run the "from boilerpipe.extract import Extractor" in Canopy I get this erorr:

Should I change how I defined the JAVA_HOME variable?
Is the path on the code affected by my windows installation in Spanish?

Fix simple typo: argment -> argument

Issue Type

[x] Bug (Typo)

Steps to Replicate

Examine README.md.
Search for argment.

Expected Behaviour

Should read argument.

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/python-boilerpipe/pull/new/bugfix_typo_argument

Thanks.

A fatal error has been detected by the Java Runtime Environment: SIGSEGV (0xb)

Hey,
Python-biolerpipe work perfectly from the console and as a script but when i trying it out with my flask application it breaks .This break when i try to instantiated Extractor and pass the url . This is what i get

http://pastebin.com/Rhzfh3hE

Initially i thought this problem is coming from jpype i raised a ticket there too . Didint help much
jpype-project/jpype#22

Environment details

Python - 2.7.3
java version "1.7.0_45"

Flask==0.10.1
JPype1==0.5.4.5
boilerpipe==1.2.0.0

I did saw similar issue been raised but that didnt help much :-/ . Any help will be appreciated.Thanks

Use of socket.setdefaulttimeout in import-level code

It seems to me that changing the global timeout is not a good idea. This can affect other code that uses networking.
My question is, is it possible to replace the use of socket.setdefaulttimeout with the use of timeout param on urlopen? The parameter exists since python 2.6 / 3.0, and the JPype requires 2.6 anyway, so it is possible to use this parameter.

misja / python-boilerpipe Goto Github PK

python-boilerpipe's Introduction

python-boilerpipe

Configuration

Installation

Usage

python-boilerpipe's People

Contributors

Stargazers

Watchers

Forkers

python-boilerpipe's Issues

code:

!/usr/bin/python

import boilerpipe

In [2]: from boilerpipe.extract import Extractor

Issue Type

Steps to Replicate

Expected Behaviour

Recommend Projects

Recommend Topics

Recommend Org