timbertson / python-readability Goto Github PK

[abandoned] python port of arc90's readability bookmarklet

Python 100.00%

python-readability's Introduction

This code is under the Apache License 2.0.  http://www.apache.org/licenses/LICENSE-2.0

This is a python port of a ruby port of arc90's readability project

http://lab.arc90.com/experiments/readability/

Given a html document, it pulls out the main body text and cleans it up.

Ruby port by starrhorne and iterationlabs
Python port by gfxmonk

This port uses BeautifulSoup for the HTML parsing. That means it can be
a little slow, but will work on Google App Engine (unlike libxml-based
libraries)


**note**: I don't currently have any plans for using or improving this
library, and it's far from perfect (slow, and almost certainly buggy).
So if you do something cool with it or have a better tool that does
the same job, please let me know and I can link to it from here.

If you're looking for alternatives / forks, here's the list so far:
 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
 - https://github.com/buriy/python-readability

python-readability's People

Contributors

Stargazers

Watchers

Forkers

seanbrant bocker tomdyson sbuss herbyme buriy obaqueiro abishek felinx predatell sheyong proximamonkey joeysim wjhwsh hengjie yongsun edd07 gotomypc huangciyin masdude rse43 hansenrum 08opt jokaye xuxiandi manmadewind sibirtsev ddesign84 engmsaleh whalebot-helmsman netconstructor davinci787 imclab ilgarm biznixcn rmoorman mmenchu sashka channing huokedu type-of-read moyuji kaleidicforks myclry onkea agentwx davidchu201 rsuhada yuanfeng0905 songofhack haohailuo howardyan93 alanhome ominux bjmayor bradphilips kyhoolee wudaclark omega-spinti tomhttp flybirp jackiedong168 shangzhih gofancyever allenmien daydaystudylab xxxxxthhh sigino mwrighte38 chenhl nevergreen delaney-shaman james-bond-007 c293824 singularitypostman escape-char lhongjum cnxue thirtyai iq-scm

python-readability's Issues

bug in scoring

I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0

all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.

An idea would be to remove child nodes from the parent before calculating the score.

investigate

http://www.carrefour.com/cdc/group/current-news/colombia---opening-of-the-65-amp--66th-carrefour.html

readbility gives the content; python-readability gives a hidden table.

issue when compiling dragnet

vagrant@vagrant-ubuntu-trusty-64:/vagrant/dragnet$ sudo make test
nosetests --exe --cover-package=dragnet --with-coverage --cover-branches -v --cover-erase
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
Failure: ImportError (No module named readability) ... ERROR
Failure: ImportError (No module named readability) ... ERROR

lxml error

No handlers could be found for logger "readability.readability"
Traceback (most recent call last):
File "readability_parse.py", line 73, in
page_content = readability_about(html_path)
File "readability_parse.py", line 26, in readability_about
page_content = Document(html_str).summary()
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 195, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 147, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 105, in _html
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 109, in _parse
File "build/bdist.linux-x86_64/egg/readability/htmls.py", line 21, in build_doc
File "/home/work/anaconda2/lib/python2.7/site-packages/lxml/html/init.py", line 614, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3103, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70569)
File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106403)
File "parser.pxi", line 1716, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:105194)
File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:99876)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 631, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95065)
readability.readability.Unparseable: None

Polish characters

If site containts polish (and probably any non-standard) characters, scripts remove them from output. Input:

test 123 zażółć gęślą jaźń tęst

Output:

test 123 za gl ja tst

timbertson / python-readability Goto Github PK

python-readability's Introduction

python-readability's People

Contributors

Stargazers

Watchers

Forkers

python-readability's Issues

bug in scoring

investigate

issue when compiling dragnet

lxml error

Polish characters

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent