Code Monkey home page Code Monkey logo

python-readability's Introduction

This code is under the Apache License 2.0.  http://www.apache.org/licenses/LICENSE-2.0

This is a python port of a ruby port of arc90's readability project

http://lab.arc90.com/experiments/readability/

Given a html document, it pulls out the main body text and cleans it up.

Ruby port by starrhorne and iterationlabs
Python port by gfxmonk

This port uses BeautifulSoup for the HTML parsing. That means it can be
a little slow, but will work on Google App Engine (unlike libxml-based
libraries)


**note**: I don't currently have any plans for using or improving this
library, and it's far from perfect (slow, and almost certainly buggy).
So if you do something cool with it or have a better tool that does
the same job, please let me know and I can link to it from here.

If you're looking for alternatives / forks, here's the list so far:
 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
 - https://github.com/buriy/python-readability

python-readability's People

Contributors

seanbrant avatar timbertson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-readability's Issues

bug in scoring

I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0

all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.

An idea would be to remove child nodes from the parent before calculating the score.

issue when compiling dragnet

vagrant@vagrant-ubuntu-trusty-64:/vagrant/dragnet$ sudo make test
nosetests --exe --cover-package=dragnet --with-coverage --cover-branches -v --cover-erase
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
Failure: ImportError (No module named readability) ... ERROR
Failure: ImportError (No module named readability) ... ERROR

lxml error

No handlers could be found for logger "readability.readability"
Traceback (most recent call last):
File "readability_parse.py", line 73, in
page_content = readability_about(html_path)
File "readability_parse.py", line 26, in readability_about
page_content = Document(html_str).summary()
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 195, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 147, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 105, in _html
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 109, in _parse
File "build/bdist.linux-x86_64/egg/readability/htmls.py", line 21, in build_doc
File "/home/work/anaconda2/lib/python2.7/site-packages/lxml/html/init.py", line 614, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3103, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70569)
File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106403)
File "parser.pxi", line 1716, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:105194)
File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:99876)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 631, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95065)
readability.readability.Unparseable: None

Polish characters

If site containts polish (and probably any non-standard) characters, scripts remove them from output. Input:

test 123 zażółć gęślą jaźń tęst


Output:

test 123 za gl ja tst

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.