luin / readability Goto Github PK

View Code? Open in Web Editor NEW

2.5K 80.0 315.0 3.45 MB

📚 Turn any web page into a clean view

JavaScript 21.97% HTML 77.74% Shell 0.29%

readability gbk instapaper jsdom

readability's Introduction

Readability

Turn any web page into a clean view. This module is based on arc90's readability project.

Features

Optimized for more websites.
Supporting HTML5 tags (article, section) and Microdata API.
Focusing on both accuracy and performance. 4x times faster than arc90's version.
Supporting encodings such as GBK and GB2312.
Converting relative urls to absolute for images and links automatically (Thank Guillermo Baigorria & Tom Sutton).

Example

Before -> After

Install

$ npm install node-readability

Note that from v2.0.0, this module only works with Node.js >= 2.0. In the meantime you are still welcome to install a release in the 1.x series (by npm install node-readability@1) if you use an older Node.js version.

Usage

read(html [, options], callback)

Where

html url or html code.
options is an optional options object
callback is the callback to run - callback(error, article, meta)

Example

var read = require('node-readability');

read('http://howtonode.org/really-simple-file-uploads', function(err, article, meta) {
  // Main Article
  console.log(article.content);
  // Title
  console.log(article.title);

  // HTML Source Code
  console.log(article.html);
  // DOM
  console.log(article.document);

  // Response Object from Request Lib
  console.log(meta);

  // Close article to clean up jsdom and prevent leaks
  article.close();
});

NB If the page has been marked with charset other than utf-8, it will be converted automatically. Charsets such as GBK, GB2312 is also supported.

Options

node-readability will pass the options to request directly. See request lib to view all available options.

node-readability has two additional options:

cleanRulers which allow set your own validation rule for tags.

If true rule is valid, otherwise no. options.cleanRulers = [callback(obj, tagName)]

read(url, {
  cleanRulers: [
    function(obj, tag) {
      if(tag === 'object') {
        if(obj.getAttribute('class') === 'BrightcoveExperience') {
          return true;
        }
      }
    }
  ]}, function(err, article, response) {
    //...
  });

preprocess which should be a function to check or modify downloaded source before passing it to readability.

options.preprocess = callback(source, response, contentType, callback);

read(url, {
    preprocess: function(source, response, contentType, callback) {
      if (source.length > maxBodySize) {
        return callback(new Error('too big'));
      }
      callback(null, source);
    }
  }, function(err, article, response) {
    //...
  });

article object

content

The article content of the web page. Return false if failed.

title

The article title of the web page. It's may not same to the text in the <title> tag.

textBody

A string containing all the text found on the page

html

The original html of the web page.

document

The document of the web page generated by jsdom. You can use it to access the DOM directly (for example, article.document.getElementById('main')).

meta object

Response object from request lib. If you need to get current url after all redirect or get some headers it can be useful.

Why not Cheerio

This lib is using jsdom to parse HTML instead of cheerio because some data such as image size and element visibility isn't able to acquire when using cheerio, which will significantly affect the result.

Contributors

https://github.com/luin/node-readability/graphs/contributors

License

This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0

readability's People

Contributors

Stargazers

Watchers

Forkers

brianmaissy lt1946 tury aptgeek btmills oliverjash rumpl haochong alexgaspar f1nnix wfang2002 kingxsp honestqiao gbaygon engmsaleh indooorsman buley mcpoet hellmagic newstex yanni4night aiex leizongmin joeylin zhuangya usirin ralucas eiriklv binwangcn olragon dfleming juancarloscruzd jessepollak levi aronwoost buggyj murshen pdehaan vlad-x ph0bos shaohua mz2 yawetse shanelau rattrayalex coornail ybak vinci-xu ecauchy markg85 mozii royaltm ruanyl free-language midknight41 rskumar sbmaxx outboundexplorer zzz6519003 mtford90 kublaj carwestsam stamaimer rdy hcxiong hermesreader woodwardoge tecmus ikenbe podcctv mxr576 panozzaj csotiriou themucha abeet ournet neo-nie onlyone0001 type-of-read tildalabs chrisyip synle jupapios herbyme aroneiermann bigindian acardinale lilidd giastfader codingabe zyq001 lwdgit meilunzhi lehoaian lixiangnlp wzbg swenhu miaokuan myclry pieterscheffers

readability's Issues

Main article image

Any chances of providing an API to fetch main image of the article soon ?

article

我正在做一个网络书签。我的需求是只想要article.title的内容。请问，能否通过配置，我拿到title内容之后，他就不再去获取网络内容。因为我见过测试，有些需要四五秒之后才将article内容获取完毕，这个时间对我来说有点长。

Screenshots

Hi. This might be an awesome project but I don't know yet because I can't see an example of what it produces.

It would be great if you'd start out the README.markdown with a big BEFORE/AFTER screenshot.

gr,

Tom

Joining multiple pages

Have there been any thoughts about joining multiple pages of a paginated article together? I notice the fork of Readability that Safari uses does this.

memory leak

In the jsdom callback, you should call window.close().

The jsdom documentation gives just a hint :

jsdom.env(html, function (errors, window) {
  // free memory associated with the window
  window.close();
});

But the change is spectacular : after parsing 20 pages (and forcing the GC), the heap took 60MB without window.close(), 30MB with.

I put the window.close() just after the error/success callback, expecting to break everything (window.document seems to be used lazily to get the content/title). Amazingly it didn't, but my naive patch still seems hazardous !

Modifying content as DOM

I'd like to change some of the content using things like getElementByTagName. I know I can do this to the document object, but what about content? Currently I'm passing the content as HTML back into another instance of jsdom to do this.

Some relative images not made absolute

In this link: http://www.alternet.org/mccain-suggests-israel-go-rogue-blow-iran-negotiations-starting-war

The URL I get after scraping is file:///files/styles/story_image/public/images/AFP/photo_1327270719840-1-0.jpg

Doesn't get content: confitdent.com

This one does not seem to pull most content, one line.

http://confitdent.com/lebron-james-workout-and-diet/

io.js v3 support

Since jsdom doesn't support node.js <= 0.12 anymore:

Note that as of our 4.0.0 release, jsdom no longer works with Node.js™ (why?), and instead requires io.js (which is planned to replace Node.js™).

And jsdom <= 3.x could not be installed on io.js:

gyp ERR! build error
gyp ERR! stack Error: `make` failed with exit code: 2
gyp ERR! stack     at ChildProcess.onExit (/Users/Chris/.nvm/versions/io.js/v3.2.0/lib/node_modules/npm/node_modules/node-gyp/lib/build.js:269:23)
gyp ERR! stack     at emitTwo (events.js:87:13)
gyp ERR! stack     at ChildProcess.emit (events.js:172:7)
gyp ERR! stack     at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
gyp ERR! System Darwin 15.0.0
gyp ERR! command "/Users/Chris/.nvm/versions/io.js/v3.2.0/bin/iojs" "/Users/Chris/.nvm/versions/io.js/v3.2.0/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
gyp ERR! cwd /Users/Chris/Workspace/demo/node_modules/jsdom/node_modules/contextify

So I suggest:

Drop node.js <= 0.12 and io.js v1.x (coz there's an installaion issue with io.js v1.x). Or
Switch to cheerio (Note: we may need to implement iframe support)

Using both article.content and article.textBody does not work

If I do this:

var textBody = article.textBody;
var content = article.content;

The content turns out to be wrong, but if I change the order the same thing applies to the textBody. You can call only one.

Strange encoding when scraping Inc.com

When scraping this link I get data like the string below: http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001

<div>���z�8�0��N��Yݲ�M|��ؽ|J�i���t�^�~�P$$1�H
I�Vw�y�m��ߵ|���dW&lt;�d�03�L'" �B�PU�*���������c֏���W���^�[�G�p�^����ݭ���Wonoo���L����p���wRO^�������{�������������YW廝T{]��߿�Y����h��w;�(v����s��?{�:�
��[q,����[�w�[|��_������n/U5��Ad�"6'
9�������0b��1�|v�]��tٯ��pf��ȲxvG.;�����Q��b'��C|U�����d�9��lZ�3��4q/ڭ���gc�Ģ&gt;a�    �F��H��кX0`��L�Ȝ���
g��+�{�
�uǵZM����Vau�[��

Any guess why or how to convert it to something readable?

body undefined

Nodejs 0.8.12

failed at /node_modules/node-readability/src/readability.js:89
if (typeof body !== 'string') body = body.toString();
^
TypeError: Cannot call method 'toString' of undefined

every single page has the same issue.

Any copyright issues using readability to render content to the client?

Hi,

I am making a news reader app and would like to implement a reader view using this module.

But I am wondering would it be infringing copy rights since the content has to go through my server before rendering to the client?

It is different from Apple's safari reader view/Adblock Plus since they are on the client side.

Empty "HTML" causes massive issues

When reading certain URLs, the body returns empty, which I believe is because of being blocked by the provider. When this happens, instead of an error being returned, an exception is raised by jsdom, because the empty HTML object is passed right into it.

Simple STR:

var readability = require('node-readability');
var url = 'http://dotearth.blogs.nytimes.com/2013/11/21/did-90-companies-cause-the-climate-crisis-of-the-21st-century/';
readability.read(url, { timeout: 5000 }, function(err, article) {
   // It will never reach this point
   console.log(err, article);
});

Adding this line to line 94 of readability.js solves the issue (although it doesn't fix not being able to read the URL).

    if (typeof body !== 'string') body = body.toString();
    if (!body) return callback('No Body Found');

I can make this into a pull request if needed, but I'm not sure what the deeper issue is where these URLs aren't readable.

Bump request version

request version 2.4.x seems to crash Node:

FATAL ERROR: v8::HandleScope::CreateHandle() Cannot create a handle without a HandleScope

The newest version of request works ok.

When body tag is not present innerHTML fails

Hi,

thank you for the effort to make readability accessible through node! We are using it to crawl pages and there i found the following bug:

node_modules/node-readability/src/helpers.js:72
document.body.innerHTML = document.body.innerHTML.replace(regexps.replaceBrs
^
TypeError: Cannot read property 'innerHTML' of null
at Object.module.exports.prepDocument (/data/nodejs/node_spider/node_modules/node-readability/src/helpers.js:72:42)
at new Readability (/data/nodejs/node_spider/node_modules/node-readability/src/readability.js:20:11)
at jsdom.env.done (/data/nodejs/node_spider/node_modules/node-readability/src/readability.js:98:24)
at /data/nodejs/node_spider/node_modules/node-readability/node_modules/jsdom/lib/jsdom.js:205:39
at process._tickCallback (node.js:415:13)

It seems that when the body-tag is not present in a document (rrally bad html coding) than of course innerHTML return null and the replace will fail of course.

Is it possible to fix this?

Thank you,
Max

Running node-readability in electron (node.js 0.11.13)

Hi. Can't resolve this issue. Can you help?

Using node-readability in electron (https://github.com/atom/electron)

After plain install I've got "Uncaught Error: Could not locate the bindings file."
Then after "node-gyp rebuild" I've got this:

Uncaught Error: Module version mismatch. Expected 44, got 14.", source: 
/path_to_project/node_modules/node-
readability/node_modules/jsdom/node_modules/contextify/node_modules/bindings/bindings.js

Is it possible to run node-readablility under node.js 0.11.x?

How can I run the code on my iPhone under iOS?

Has anyone an idea of how to run this code on an iOS device?

Failed to install on Mac 10.9.2

npm http 304 https://registry.npmjs.org/bindings

[email protected] install /Users/kennylee/Dropbox/Code/BitBucket/InternationalNewsBackend/node_modules/node-readability/node_modules/jsdom/node_modules/contextify
node-gyp rebuild

sh: node-gyp: command not found
npm ERR! Error: ENOENT, open '/Users/kennylee/Dropbox/Code/BitBucket/InternationalNewsBackend/node_modules/node-readability/node_modules/jsdom/node_modules/cssstyle/lib/properties/letterSpacing.js'
npm ERR! If you need help, you may report this entire log,
npm ERR! including the npm and node versions, at:
npm ERR! http://github.com/npm/npm/issues

npm ERR! System Darwin 13.1.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "node-readability"
npm ERR! cwd /Users/kennylee/Dropbox/Code/BitBucket/InternationalNewsBackend
npm ERR! node -v v0.10.26
npm ERR! npm -v 1.4.3
npm ERR! path /Users/kennylee/Dropbox/Code/BitBucket/InternationalNewsBackend/node_modules/node-readability/node_modules/jsdom/node_modules/cssstyle/lib/properties/letterSpacing.js
npm ERR! code ENOENT
npm ERR! errno 34
npm http 304 https://registry.npmjs.org/domhandler
npm http 304 https://registry.npmjs.org/domutils
npm ERR! [email protected] install: node-gyp rebuild
npm ERR! Exit status 127
npm ERR!
npm ERR! Failed at the [email protected] install script.
npm ERR! This is most likely a problem with the contextify package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get their info via:
npm ERR! npm owner ls contextify
npm ERR! There is likely additional logging output above.

npm ERR! System Darwin 13.1.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "node-readability"
npm ERR! cwd /Users/kennylee/Dropbox/Code/BitBucket/InternationalNewsBackend
npm ERR! node -v v0.10.26
npm ERR! npm -v 1.4.3
npm ERR! code ELIFECYCLE
npm http 304 https://registry.npmjs.org/domelementtype

Get text content

This library seems to be centered around HTML presentation, as the article seems to only return HTML-based properties.

I have a use case where I only care about the text in the article, and would love to get something like article.text that was just an array/string of the text of the article content.

Wondering if this is in the pipeline? Or if you would accept a PR for this?

Some sites that don't work well: Medium, Al Jazeera

Medium regularly retrieves no images, and sometimes the article is cut off near the end.

e.g.: https://medium.com/@erikdkennedy/7-rules-for-creating-gorgeous-ui-part-1-559d4e805cda

Al Jazeera doesn't get the title.

e.g.: http://www.aljazeera.com/news/2015/03/isil-fighters-bulldoze-ancient-assyrian-palace-iraq-150305195222805.html

Invalid HTML results in request error

When I pass in the following string as HTML:

    �����ZYS�Y�~vG���N���mkE���ծ*��,1[��D���J%H��de �'&amp;B���;�������X�e�����t��L=�_�s���I,���5��Gh��.��{�w�=�k�}8�K�I��H]����]+�3=]��ؒ ����G$)��p$�Bܞ�����Ot92�?g���������� q��R� �ٴ�q�Ϟ&gt;O�%!�Ͽ��hg����_B��g�7��Z�;&quot;ġ������[,�纄 ˥�H&quot;�2��
    ���ЛI��&quot;���6��2aA�SѤ�M�����RLh#�?(������ϑ\V.��ܒ���r�U^�ե���~2&gt;�@9xM�'��� C/��M������sus��A[�j�J�Og%&gt;����_.&amp;H�L�����N�W�Ȯ��Jn\=��H����|8&quot;�K�ђ2&gt;!��Sm�:��C����d�U��E���HJ����*��]�v�^_C���d�����p;]^���q�RoLptq�$����ӛ�bAV�,F�Ab�\�AV�z$���'��s�\7G�����M�2��ٿ��+:\v��c�+�m��Z��6���/j�V���h����s���u�sp�(J�
    a�� �Ѽj%���������\w����?U ��I|Zb�9Uj���F�R�h������gء�0v�|(�������oܨ�&quot;s��*1��f�� kX(�`��E�ޅ�pE���5��U�R]��v�)�ru�����L�K��l��h�ʕV�XhW�n�z{�c�N��n�A����&gt;k���2���r����ui����������s0���te��;ک.�s�r�mr�)�)�7�ꀑ��0/P/����M�Xjuv�^&quot;��V0�:#x��F��NAtD�.{2��2\Vν��G���B�������Pˎ(��������f��P���Y/���s[}W7
    Ib��|����d���쾮�ϒ�ESC0�g��wqQ@{�0�l8*&amp;c\oK(����2;���j�z��wWZ�F�d{BY��M�p&quot;��ԍ�6�Gv6���em��|�@����H�ee� �'�Α�&gt;mgC&gt;�&amp;�c$7O�%��oJ��ձ](o&gt;Q�����TW^l*c#��2 (Y���ȓ�q���|M ��ɬ\�Q�_�.@$�y�����E�Ѡ�ϣd�e%;��В�F�eeRYX�޵������#�?1�������� 'r�G�v�L畭�u�;ڑ��@{������K�0&quot;e�����Z��#m�-9�Q��A-h���J��k��\�:
    |&quot;�H�0�ΐ���^g����d����^�m|Gr�?��A���P���?��c �k�������C���5�rv����l���N�Ŝ�c��&amp;�j�ET�9\Quk�;�����n��BB�Ίz�q������Q#�����]�t|�N������ t�\׺��@����|��} ��D;.ʠ�Zg���q�A��X,��6����O B�6�TZ`��i.��za��vFX&amp;� K� ����2����]������T�튆�1����-��s����].��z�-�K&amp;cQ�C���sC��yd�n���#b?]�0n�).�����
    �SW���Pu����ԚR��O���.&amp;�$k��Z%7A���`9������ۢ�M��g��ֺ�Ѱ�i3�_m�g����UX�d~��4m�V�̻tj��r?��ˡ.������2|r��Ll讎�]=���x�ǶQ���-��jůǭZ%�&quot;���G O��E�8��`_����ce��:��������8��-Wf^RLS7�M��|�����aHߖ�� ��l�+/fԭ-��UfKZ���=�&quot;��2��2�DN�?���&amp;���Nh�:���ϸ���ީ)~����S��Q�����)�/Iy_��$�����Z!��\�&amp;C�ri�!kf\.��#�=((F�'���a�����3$�i�p�J�r)�����G��{������G���3U������=o��������z��虆qz��Ĝ���:6L޾��@�o��K�
    �}�5� �MO�����&gt;Cߧ�}�a��@_ ��^�W�����\���|����1���T��rq���4]t����g� z�aK�0�����cp�+������U����.������LJS�N�}�R �K-�C,���du����}g��r�P��ryˌ��WZ��M�CT�������H��r�Џ��� �B��� �����|j�D��'?AZ�� �F��)�%�G����������@/�MAP���| �����\�T�g�����H�;  �g�I�sP]Ɏ)Ko��dr��R�UC�����b��p`j�C��a��3���(:��crqH&gt;\��#�SB࿲���^c�hgN=y�A|�����9C�L�a�w�D�:y'��q|����i���R�#�
    Cm(T�������� W���4�ɿ���Ûzwfl�6������lL����L��9�#��p�����7�����M���4;�w��������9#�ҙ0��z!�s���E�a��&amp;K�n�Iq��=��!��@�0�Z���.����`���Ä��&amp;Hߦ6�D]\��#���;7'R�������&gt;,*#���)!d��ˡ7�Ќ��5s�0��Q?�Y�(���4��K������#|�I������u��c��L.Α���Iep��

It should return an empty string as readable content, but instead I get:

    [2015-06-30 18:57:57.560] [ERROR] [default] - { [Error: Invalid protocol: null] cause: [Error: Invalid protocol: null], isOperational: true }
    Error: Invalid protocol: null
        at Request.self._buildRequest (/home/tom/projects/project/project-pce/node_modules/node-readability/node_modules/request/request.js:355:53)
        at Request.init (/home/tom/projects/project/project-pce/node_modules/node-readability/node_modules/request/request.js:533:10)
        at new Request (/home/tom/projects/project/project-pce/node_modules/node-readability/node_modules/request/request.js:99:8)
        at request (/home/tom/projects/project/project-pce/node_modules/node-readability/node_modules/request/index.js:54:11)
        at Function.read (/home/tom/projects/project/project-pce/node_modules/node-readability/src/readability.js:170:5)
        at Function.tryCatcher (/home/tom/projects/project/project-pce/node_modules/bluebird/js/main/util.js:24:31)
        at ret (eval at <anonymous> (/home/tom/projects/project/project-pce/node_modules/bluebird/js/main/promisify.js:155:12), <anonymous>:14:23)
        at extractContent (/home/tom/projects/project/project-pce/app/index.js:34:10)
        at tryCatcher (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/util.js:24:31)
        at Promise._settlePromiseFromHandler (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/promise.js:452:31)
        at Promise._settlePromiseAt (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/promise.js:530:18)
        at Promise._settlePromises (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/promise.js:646:14)
        at Async._drainQueue (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/async.js:182:16)
        at Async._drainQueues (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/async.js:192:10)
        at Immediate.Async.drainQueues [as _onImmediate] (/home/tom/projects/project/project-api/node_modules/bluebird/js/main/async.js:15:14)
        at processImmediate [as _immediateCallback] (timers.js:358:17)

It seems like it is misinterpreting this string as an URL and passing it to request?

Change jsdom module to cheerio. It should become up to 8x faster

Cheerio module works faster:

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

https://github.com/MatthewMueller/cheerio

关于返回的article对象。

How to Fetch Author details from Article

Not able to fetch author details from the Article using Node-Readability.
From Node-readability only able to fetch the Title and Content.

Need a output like https://www.readability.com/api/
Readability Parse API

{
"domain": "www.ainonline.com"
"next_page_id": null
"url": "http://www.ainonline.com/aviation-news/aerospace/2016-01-14/boeing-and-unions-reconciled-sides-reach-tentative-deal"
"short_url": "http://rdd.me/xbgocfkn"
"author": "Gregory Polek"
"excerpt": "In a development portrayed as a product of efforts to mend years of bitterly contentious labor relations, Boeing management and leaders of the Society of Professional Engineering Employees in&hellip;"
"direction": "ltr"
"word_count": 515
"total_pages": 0
"content": "<div><div class="field field-name-body field-type-text-with-summary field-label-hidden field-wrapper body field"><p>In a development portrayed as a product of efforts to mend years of bitterly contentious labor relations, Boeing management and leaders of the Society of Professional Engineering Employees in Aerospace (<span class="caps">SPEEA</span>) have tentatively agreed on a six-year contract extension, the sides announced Wednesday evening. Scheduled for a vote by mail-in ballots between January 27 and February 10, the largely unanticipated agreement comes several months before the current contract&#x2019;s amendable&#xA0;date.</p> <p><span class="dquo">&#x201C;</span>These negotiations were possible because <span class="caps">SPEEA</span> and Boeing decided not to let our areas of disagreement prevent us from making progress on items where we do agree,&#x201D; said <span class="caps">SPEEA</span> executive director Ray Goforth. &#x201C;These contract extensions are the result of a lot of hard work and good will.&#xA0;Hopefully, this gives us a template for the&#xA0;future.&#x201D;</p> <p>Negotiations grew from discussions during regular meetings between the union and Boeing in recent months, said <span class="caps">SPEEA</span>. Calling both sides receptive to avoiding confrontation that characterized past negotiations, the <span class="caps">SPEEA</span> member-elected executive board began formal talks with Boeing after the holiday break. Several of the elected officials served on previous union contract negotiation&#xA0;teams.</p> <p>During the last round of negotiations involving <span class="caps">SPEEA</span>, 96 percent of the union members voted down Boeing&#x2019;s first contract offer in September 2012. Tensions only heightened after <span class="caps">SPEEA</span> filed charges with the National Labor Relations Board on October 5 accusing Boeing officials of videotaping union members engaged in &#x201C;solidarity&#x201D; marches, seizing employees&#x2019; cameras and deleting photos of their activities during lunchtime rallies in Portland, Oregon; and Everett,&#xA0;Washington. By February a strike appeared imminent. Not until Boeing extended most elements of the previous contracts, including 5 percent annual wage pools and no increases to employees for medical coverage, did the engineers narrowly agree to a deal on February 19. Following a second rejection, technical workers finally accepted the same offer in a third vote counted a month&#xA0;later.</p> <p>This time, both sides struck a more conciliatory tone. &#x201C;This tentative agreement recognizes the significant contributions of our engineering and technical workforce and reinforces Boeing's commitment to the Puget Sound region,&#x201D; said Boeing Commercial Airplanes president and <span class="caps">CEO</span> Ray&#xA0;Conner.</p> <p><span class="caps">SPEEA</span> called elements in the new offer that would help employees affected by the decisions by the company to move work &#x201C;a major improvement,&#x201D; while praising the company&#x2019;s commitment to use &#x201C;exhaustive&#x201D; efforts to place individuals affected by any such moves. If placement efforts fail, said the union, laid-off workers would receive a minimum of 26 to a maximum of 60 weeks of pay&#x2014;or two weeks per year of service&#x2014;and six months of medical and dental coverage. Those protections would come with a doubling of the existing voluntary layoff&#xA0;benefits.</p> <p>For Boeing&#x2019;s part, the deal would finally do away with the last vestiges of its traditional pension plan for <span class="caps">SPEEA</span>-represented employees hired before March 2013. A new retirement savings program would include a new special company retirement contribution and &#x201C;enhanced&#x201D; 401(k) transition contributions. All other employees represented by the union already participate in a new retirement savings&#xA0;program.</p> <p>While the majority of the union&#x2019;s members work at Boeing facilities in the Puget Sound region, the contract offers also cover workers in Oregon, Utah, California and&#xA0;Florida.</p> </div> <div class="show-for-print"><p>http://www.ainonline.com/aviation-news/aerospace/2016-01-14/boeing-and-unions-reconciled-sides-reach-tentative-deal</p></div> </div>"
"date_published": "2016-01-14 00:00:00"
"dek": null
"lead_image_url": "http://www.ainonline.com/apple-touch-icon-144x144.png"
"title": "Boeing and Unions Reconciled as Sides Reach Tentative Deal"
"rendered_pages": 1
}

Resource URLs become broken

Links to pagination and resources are broken if relative, if the links were made absolute during article extraction usability would be improved.

Fails to extract content from BBC - missing body tag?

http://www.bbc.co.uk/sport/0/football/24841086

Console shows:

node-readability\src\helpers.js:308
textContent = e.textContent.trim();
^
TypeError: Cannot call method 'trim' of undefined

image / link in html

When I load raw html, I get an exception if I have an image or a link.

Example :

var readability = require('node-readability');

readability("<html><body><div class='article'>" +
 // the a / img will trigger the exception
 "<a href='#top'><img src='/img/never.png'></a> THE END IS NEVER THE END IS NEVER " + 
"</div></body></html>", function(err, doc){
    console.log("title : ", doc.title, "content : ", doc.content);
});

The error is :

TypeError: Parameter 'url' must be a string, not object
    at Url.parse (url.js:107:11)
    at urlParse (url.js:101:5)
    at Object.urlResolve [as resolve] (url.js:405:10)
    at fixLink (/tmp/node_modules/node-readability/src/helpers.js:498:25)
    at fixLinks (/tmp/node_modules/node-readability/src/helpers.js:507:43)
    at prepArticle (/tmp/node_modules/node-readability/src/helpers.js:601:3)
    at Object.module.exports.grabArticle (/tmp/node_modules/node-readability/src/helpers.js:261:3)
    at Readability.getContent (/tmp/node_modules/node-readability/src/readability.js:50:32)
    at Readability.content (/tmp/node_modules/node-readability/src/readability.js:29:17)

I load raw html so jsdomParse sets window.document.originalURL to null. When resolving the links in fixLink I still have null in e.ownerDocument.originalURL which triggers the exception.

Memory is not freed after article.close()

Memory usage

Readability or jsdom are using a huge amount of ram (10+ MBs) to parse a small web page (500kb) and they never free used memory.
This will not let us use node-readability for our web scraper.

I'm not sure if it is caused by jsdom or not, if so, will be an easy way to switch jsdom with cheerio?
It would be great if there was a config for that.

My environment

CPU: 2.3 GHz Intel Core i7
OS: OSX Yosemite 10.10.1 (14B25)
RAM: 16 GB 1600 MHz DDR3
Node.JS verison: v0.11.14

How to reproduce problem

var read = require('node-readability');

function useNodeReadability() {
    read('http://farsnews.com/newstext.php?nn=13930926000105', function(error, article, meta) {
        if (error)
        {
            console.error('Fetch Error');
            process.exit();
        }

        console.log('Readability work done here');
        article.close();
    });
}

setInterval(function() {
    console.log(process.memoryUsage());
}, 1000);

setInterval(useNodeReadability, 5000);

Expected result

Memory usage should be free after each execution

Actual result

Heap and RSS memory are increasing

My results:

{ rss: 93995008, heapTotal: 74054656, heapUsed: 44526608 }
{ rss: 94306304, heapTotal: 74054656, heapUsed: 44886264 }
{ rss: 94310400, heapTotal: 74054656, heapUsed: 44932392 }
{ rss: 94310400, heapTotal: 74054656, heapUsed: 44940960 }
{ rss: 94576640, heapTotal: 75074560, heapUsed: 45400320 }
{ rss: 94986240, heapTotal: 76106496, heapUsed: 45895440 }
{ rss: 95064064, heapTotal: 76106496, heapUsed: 46026496 }
{ rss: 95072256, heapTotal: 76106496, heapUsed: 46059808 }
Readability work done here
{ rss: 107675648, heapTotal: 86389760, heapUsed: 61067240 }
{ rss: 107819008, heapTotal: 86389760, heapUsed: 61295024 }
{ rss: 107819008, heapTotal: 86389760, heapUsed: 61302864 }
{ rss: 107819008, heapTotal: 86389760, heapUsed: 61313048 }
{ rss: 107925504, heapTotal: 86389760, heapUsed: 61485680 }
{ rss: 107933696, heapTotal: 86389760, heapUsed: 61532520 }
{ rss: 108036096, heapTotal: 86389760, heapUsed: 61665536 }
Readability work done here
{ rss: 115023872, heapTotal: 90505472, heapUsed: 56241112 }
{ rss: 115023872, heapTotal: 90505472, heapUsed: 56259632 }
{ rss: 115052544, heapTotal: 90505472, heapUsed: 56660656 }
{ rss: 115130368, heapTotal: 90505472, heapUsed: 56715088 }
{ rss: 115376128, heapTotal: 90505472, heapUsed: 56893912 }
Readability work done here
{ rss: 115843072, heapTotal: 90505472, heapUsed: 63118632 }
{ rss: 115843072, heapTotal: 90505472, heapUsed: 63210512 }
{ rss: 115851264, heapTotal: 90505472, heapUsed: 63264552 }
Readability work done here
{ rss: 116563968, heapTotal: 92569344, heapUsed: 69165648 }
{ rss: 116572160, heapTotal: 92569344, heapUsed: 69255816 }
{ rss: 116936704, heapTotal: 92569344, heapUsed: 69453480 }
Readability work done here
{ rss: 120549376, heapTotal: 95653120, heapUsed: 74474080 }
...

thanks

Anti scraping systems like Cloudflare?

Is there any solution to avoid anti scrapping systems like Cloudflare?
I 've runned the code with an url protected by anti scraping Cloudflare. It presents a captcha form instead the real content. I think, changing the request type. make it less "robot request"... the problem will be gone.
The script is using the "request" library, any additional configuration would resolve the problem.
Thanks.

Memory leak

I'm using the library in my project: https://github.com/Coornail/rss-to-full-rss

The program is a long-running process.
It is not uncommon to process 100s of articles, when this happen, the memory shoots up by 200 megabytes, and it never gets freed.

I tried upgrading to Jsdom 1.0.3 as they seem to suffer from several memory issue: https://github.com/tmpvar/jsdom/issues?q=memory . Unfortunately upgrading made the issue worse.

gb2312字符集乱码

main.js

var read = require('node-readability');

read("http://xw.qq.com/news/20141208016835", {encodeing: 'gb2312'}, function(err, article, meta) {

      console.log(article.title);

      article.close();
});

输出：

"澶琛"ヤ26浜 缇借璧瓒涓戒骞村璐

系统：mac os 10.10.1 (14B25)

both 2.0 or 1.* can not work for node.js v4.0

node.js and io.js are combine to one project as node.js v4.0,after upgrade my node.js,both 2.0 or i.* can not work anymore,there's some dependencies modules version error.but i'm not sure which are.

js-bson: Failed to load c++ bson extension, using pure JS version
[Error: Module did not self-register.]

Not able to parse github wiki page

Like this one https://github.com/pluginaweek/state_machine/wiki/Idempotent-Callback-Registration.

It will be better to get the standard err rather than this.

 <div>
 <a href="https://github.com/pluginaweek/state_machine/wiki/Idempotent-Callback-Registration#start-of-content" tabindex="1" class="accessibility-aid js-skip-to-content">Skip to content</a>
 <!-- /.wrapper -->

  <!-- /.container -->
 <div id="ajax-error-message" class="flash flash-error"><span>
   </span><span class="octicon octicon-alert"></span><span>
   </span><a href="https://github.com/pluginaweek/state_machine/wiki/Idempotent-Callback- Registration#" class="octicon octicon-x flash-close js-ajax-error-dismiss" aria-label="Dismiss error"> </a><span>
   Something went wrong with that request. Please try again.
 </span></div>
   <script crossorigin="anonymous" src="https://assets-cdn.github.com/assets/frameworks-2d727fed4d969b14b28165c75ad12d7dddd56c0198fa70cedc3fdad7ac395b2c.js" type="text/javascript"></script>
   <script async="async" crossorigin="anonymous" src="https://assets-cdn.github.com/assets/github-f82405eac9208116886d504ad90a85513ea8de114d676a6cf7f35aaa497cb974.js" type="text/javascript"></script>

   <script crossorigin="anonymous" src="https://assets-cdn.github.com/assets/wiki-c7729cd55f83879c219fca45441d1815803f3c5ac14e93765add63b81babc501.js" type="text/javascript"></script>

Excluding Image Tags

Hey there,

Is it possible to exclude image tags when generating the getArticle() content?

Upgrade jsdom

Is there any reason why jsdom is still on version 0.2.x here?

Pulls in hidden text and ignores article tags

When run against stories like this one: http://www.edelements.com/what-does-educational-success-look-like-depends-on-whos-doing-the-looking/

You can quickly see how the readability system is pulling in a lot of hidden text that is not part of the article. This is particularly surprising because this page is well structured and includes a scoped

tag:

<article itemscope="" itemtype="http://schema.org/BlogPosting" id="post-4348" class="post-4348 post type-post status-publish format-standard has-post-thumbnail category-blog g1-complete">

Browserify issue

Hi!

Is it possible to make it work with Browserify? I can't make it works.

my dev.js file is simple:

var read = require('node-readability');

in console i tryed:

browserify dev.js -o main.js

And i got an error:

Error: Cannot find module 'iconv' from '/node_modules/node-readability/node_modules/encoding/lib'
    at /node_modules/browserify/node_modules/resolve/lib/async.js:50:17
    at process (/node_modules/browserify/node_modules/resolve/lib/async.js:119:43)
    at /node_modules/browserify/node_modules/resolve/lib/async.js:128:21
    at load (/node_modules/browserify/node_modules/resolve/lib/async.js:60:43)
    at /node_modules/browserify/node_modules/resolve/lib/async.js:66:22
    at /node_modules/browserify/node_modules/resolve/lib/async.js:21:47
    at Object.oncomplete (fs.js:107:15)

I tryed to install iconv, and got new error:

Error: Cannot find module '../build/Debug/iconv.node' from '/node_modules/iconv/lib'
    at /usr/local/lib/node_modules/browserify/node_modules/resolve/lib/async.js:42:25
    at load (/usr/local/lib/node_modules/browserify/node_modules/resolve/lib/async.js:60:43)
    at /usr/local/lib/node_modules/browserify/node_modules/resolve/lib/async.js:66:22
    at /usr/local/lib/node_modules/browserify/node_modules/resolve/lib/async.js:21:47
    at Object.oncomplete (fs.js:107:15)

Is it Browserify issue, or node-readability?
Thanks in advance!

Don't follow all redirects for links like in twitter

For links like http://t.co/QY6AijTazc return empty content. Works well if use https://github.com/mikeal/request library.

Exceeded maxRedirects with nytimes.com links

(Just leaving this here, will investigate a bit later)

Given a New York Times URL such as this:

http://www.nytimes.com/2016/07/12/technology/pokemon-go-brings-augmented-reality-to-a-mass-audience.html

The request will fail with this error:

Error: Exceeded maxRedirects. Probably stuck in a redirect loop http://www.nytimes.com/2016/07/12/technology/pokemon-go-brings-augmented-reality-to-a-mass-audience.html?_r=4

Note that nytimes.com has some convoluted server configuration and returns a HTTP code of 303.

...you'll get the same redirection behavior with cURL:

$ curl -IL http://www.nytimes.com/2016/07/12/technology/pokemon-go-brings-augmented-reality-to-a-mass-audience.html

HTTP/1.1 303 See Other
Server: Varnish
location: https://myaccount.nytimes.com/auth/login?URI=http%3A%2F%2Fwww.nytimes.com%2F2016%2F07%2F12%2Ftechnology%2Fpokemon-go-brings-augmented-reality-to-a-mass-audience.html%3F_r%3D5&REFUSE_COOKIE_ERROR=SHOW_ERROR
Accept-Ranges: bytes
Date: Tue, 12 Jul 2016 12:12:38 GMT
Age: 0
X-API-Version: 5-0
X-PageType: article
Connection: close
X-Frame-Options: DENY
Set-Cookie: RMID=007f010123545784deb60008;Path=/; Domain=.nytimes.com;Expires=Wed, 12 Jul 2017 12:12:38 UTC

HTTP/1.1 200 OK
Date: Tue, 12 Jul 2016 12:12:41 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Set-Cookie: __cfduid=dce29bea6d432f3d2e44a8bbe3e1220aa1468325561; expires=Wed, 12-Jul-17 12:12:41 GMT; path=/; domain=.nytimes.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache
Cneonction: close
Server: cloudflare-nginx
CF-RAY: 2c1467a827722507-ORD

More heavy HTTP clients, such as whatever wget uses by default, can deal with this, as can libraries such as Python's Requests. I'm new to Node so I'm not sure what the best-practices route is.

Status of node.js support

The readme points out that io.js is slated to replace node.js -- everything I'm reading speaks to the opposite. Am I missing something, or will this repo switch gears back towards Node all is resolved?

Crashes on shallow DOMs

The following test crashes with:

  1) readability handles shallow DOMs:
     Uncaught TypeError: Cannot read property 'content' of undefined
      at /Users/tlb/yc/superduper/test_readability.js:9:26
      at jsdom.env.done (/Users/tlb/node_modules/node-readability/src/readability.js:201:43)
      at /Users/tlb/node_modules/node-readability/node_modules/jsdom/lib/jsdom.js:255:9
      at process._tickCallback (node.js:419:13)

because it's looking for grandparents of <p> nodes that don't exist in a shallow DOM tree.

  var readability = require('node-readability');
  describe('readability', function() {
    it('handles shallow DOMs', function(onDone) {
      var body = '<html><p>fooksdjfls jflksjdflksj dlfkjsdlfkjsd lfkjsdfs</p></html>';
      readability(body, {}, function(readErr, article, meta) {
        console.log('content=', article.content);
        onDone();
      });
    });
  });

How easy would a client-side port of this be?

I'm wanting to implement this in a client-side app. I know the originally readability script is in plain browser-friendly javascript, but it doesn't handle titles nearly as well as this version.

What would it take to port this over to use jQuery instead of jsdom?

Uncaught TypeError with <frame>

If the page contains <frame> tags, I get the following exception :

Uncaught TypeError: Object [ jsdom NodeList ]: contains 1 items has no method 'forEach'
  at Object.module.exports.prepDocument (/home/dduponchel/projects/node-readability/src/helpers.js:41:12)
  at new Readability (/home/dduponchel/projects/node-readability/src/readability.js:23:11)
  at jsdom.env.done (/home/dduponchel/projects/node-readability/src/readability.js:203:24)
  at /home/dduponchel/projects/node-readability/node_modules/jsdom/lib/jsdom.js:255:9
  at process._tickCallback (node.js:415:13)

Reproduced with the following unit test :

it('should handle frames', function(done) { 
  read('<html><body><frame />Hello world!</body></html>', function(err, read){
    read.document.body.innerHTML.should.include('Hello world!');
    done();                                                 
  });                                                       
});

luin / readability Goto Github PK

readability's Introduction

Readability

Features

Example

Install

Usage

Options

article object

content

title

textBody

html

document

meta object

Why not Cheerio

Contributors

License

readability's People

Contributors

Stargazers

Watchers

Forkers

readability's Issues

Memory usage

My environment

How to reproduce problem

Expected result

Actual result

Recommend Projects

Recommend Topics

Recommend Org