Code Monkey home page Code Monkey logo

quickscrape's People

Contributors

blahah avatar chreman avatar larsgw avatar mcs07 avatar mec-is avatar noamross avatar petermr avatar tarrow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

quickscrape's Issues

Error: malformed URL: -r

Very reproducible bug:

$ quickscrape -url http://ijs.sgmjournals.org/content/64/Pt_5/1802.full --loglevel verbose --scraper jscrapers/scrapers/ijsem.json  --output ./new --outformat bibjson
info: quickscrape launched with...
info: - URL: -r
info: - Scraper: jscrapers/scrapers/ijsem.json
info: - Rate limit: 3 per minute
info: - Log level: verbose
info: urls to scrape: 1
info: processing URL: -r

/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:29
    throw e;
          ^
Error: malformed URL: -r; protocol missing (must include http(s):// or ftp(s)://), domain missing
    at Object.url.checkUrl (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:28:13)
    at Thresher.scrape (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:54:7)
    at processUrl (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:183:7)
    at null._onTimeout (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:206:5)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)

Follow links

In writing a similar tool, I found that sometimes the information for a single document was provided on multiple pages, and that I needed to follow links within the page to get all metadata. This might look like this is in the scraper definition:

  "url": "\\w+\\.\\w+",
  "follow-links":  {
      "article_info":  {
         "selector":  "//meta[@name='article_info'_url]
         "attribute": "content"
          }
       }
  "elements": {
    "funder": {
      "selector": "//span[@name='funding_source']",
      "page": "article_info"
     }
  }

The scraper then also opens the pages whose URLs are collected in "follow-links", and those elements with a "page" attribute use those pages rather than the URL provided.

The use case I've seen is some journals have metadata on a different page than the abstract/DOI landing page.

can't scrape plos one paper

tried to scrape via urls.text

workshop@crunchbang:~/workshop$ quickscrape --urllist test/urls.txt --scraperdir test/
info: quickscrape launched with...
info: - URLs from file: undefined
info: - Scraperdir: test/
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 4
info: processing URL: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030039

TypeError: Cannot read property 'actions' of null
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:100:16
    at Request._callback (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60:5)
    at Request.self.callback (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:360:22)
    at Request.EventEmitter.emit (events.js:98:17)
    at Request.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1202:14)
    at Request.EventEmitter.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1150:12)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)
    at _stream_readable.js:920:16
    at process._tickCallback (node.js:415:13)

urls.txt:

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030039
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003731
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000339
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002791

TypeError: Cannot call method 'trim' of null

Cause unknown. Possibly bad scraper (attempting to download self for HTML, possibly following a closed HTML button).

localhost:jmir pm286$ quickscrape -u http://www.jmir.org/2015/5/e108/ -s jmir.json 
info: quickscrape launched with...
info: - URL: http://www.jmir.org/2015/5/e108/
info: - Scraper: jmir.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.jmir.org/2015/5/e108/
info: [scraper]. URL rendered. http://www.jmir.org/2015/5/e108/.

TypeError: Cannot call method 'trim' of null
    at Scraper.scrapeElement (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:301:19)
    at null.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:260:15)
    at emit (events.js:98:17)
    at Request._callback (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/renderer/basic.js:16:16)
    at Request.self.callback (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:368:22)
    at Request.emit (events.js:98:17)
    at Request.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1219:14)
    at Request.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1167:12)
    at IncomingMessage.emit (events.js:117:20)

scraper:

{
"url": "www\\.jmir\\.org",
"elements": {
"publisher": {
"selector": "//meta[@name='citation_publisher']",
"attribute": "content"
},
"journal": {
"selector": "//meta[@name='citation_journal_title']",
"attribute": "content"
},
"title": {
"selector": "//meta[@name='citation_title']",
"attribute": "content"
},
"authors": {
"selector": "//meta[@name='citation_author']",
"attribute": "content"
},
"date": {
"selector": "//meta[@name='citation_date']",
"attribute": "content"
},
"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},
"volume": {
"selector": "//meta[@name='citation_volume']",
"attribute": "content"
},
"issue": {
"selector": "//meta[@name='citation_issue']",
"attribute": "content"
},
"firstpage": {
"selector": "//meta[@name='citation_firstpage']",
"attribute": "content"
},
"description": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"abstract": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"fulltext_html": {
"selector": "/",
"download": {
"rename": "fulltext.html"
}
},
"fulltext_pdf": {
"selector": "//a[@class='icon-pdf article-pdf']",
"attribute": "content",
"download": {
"rename": "fulltext.pdf"
}
},
"fulltext_xml": {
"selector": "//a[@class='icon-xml article-xml']",
"attribute": "href",
"download": {
"rename": "fulltext.xml"
}
},
"supplementary_material": {
"selector": "//link[starts-with(@title,'Additional file')]",
"attribute": "href",
"download": true
},
"figure": {
"selector": "//div[@class='fig']/p/a/img",
"attribute": "src",
"download": true
},
"figure_caption": {
"selector": "//div[@class='fig']//strong"
},
"license": {
"selector": "//p[a/@href='http://creativecommons.org/licenses/by/4.0']"
},
"copyright": {
"selector": "//p[contains(.,'licensee')]"
}
}
}

Failed to npm install

Linux 3.2.0-4-amd64 Debian

npm ERR! fstream_class FileWriter
npm ERR! code ENOENT
npm ERR! errno -2
npm ERR! fstream_stack /usr/local/lib/node_modules/npm/node_modules/fstream/lib/writer.js:284:26
npm ERR! fstream_stack Object.oncomplete (fs.js:97:15)
npm http GET https://registry.npmjs.org/delayed-stream/0.0.5
npm http 304 https://registry.npmjs.org/delayed-stream/0.0.5
npm http 304 https://registry.npmjs.org/duplexer
npm http 304 https://registry.npmjs.org/carrier
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/adam/npm-debug.log
npm ERR! not ok code 0

Relative vs. Absolute href on journal pages

I'm trying to write scrapers for ACS and Nature, and they use relative links in their pages. The scraper doesn't appear to follow these relative links.

eg: for http://pubs.acs.org/doi/abstract/10.1021/ja409271s the fulltext pdf link is <a title="Download the PDF Full Text" href="/doi/pdf/10.1021/ja409271s">, this is not followed by the json:

...
    "fulltext_pdf": {
      "selector": "//a[@title='Download the PDF Full Text']",
      "attribute": "href",
      "download": true
    },
...

Error Scraping BioMedCentral

quickscrape --url http://www.biomedcentral.com/1471-2148/14/128/abstract --scraper journal-scrapers/peerj.json --output ./dinoout
info: all dependencies installed :)
info: quickscrape launched with...
info: - URL: http://www.biomedcentral.com/1471-2148/14/128/abstract
info: - Scraper: journal-scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.biomedcentral.com/1471-2148/14/128/abstract
data: fulltext_pdf: http://www.biomedcentral.com/content/pdf/1471-2148-14-128.pdf
data: title: New clade of enigmatic early archosaurs yields insights into early pseudosuchian phylogeny and the biogeography of the archosaur radiation
data: author: Richard J Butler
data: author: Corwin Sullivan
data: author: Martín D Ezcurra
data: author: Jun Liu
data: author: Agustina Lecuona
data: author: Roland B Sookias
data: date: 2014-06-10
data: doi: 10.1186/1471-2148-14-128
data: volume: 14
data: issue: 1
data: firstpage: 128
data: description: The origin and early radiation of archosaurs and closely related taxa (Archosauriformes) during the Triassic was a critical event in the evolutionary history of tetrapods. This radiation led to the dinosaur-dominated ecosystems of the Jurassic and Cretaceous, and the high present-day archosaur diversity that includes around 10,000 bird and crocodylian species. The timing and dynamics of this evolutionary radiation are currently obscured by the poorly constrained phylogenetic positions of several key early archosauriform taxa, including several species from the Middle Triassic of Argentina (Gracilisuchus stipanicicorum) and China (Turfanosuchus dabanensis, Yonghesuchus sangbiensis). These species act as unstable ‘wildcards’ in morphological phylogenetic analyses, reducing phylogenetic resolution.
info: waiting for 1 downloads to complete in background

/usr/local/lib/node_modules/quickscrape/node_modules/jsdom/lib/jsdom/browser/utils.js:9
raise.call(this, "error", "NOT IMPLEMENTED" + (nameForErrorMessage ? ":
^
TypeError: Cannot call method 'call' of undefined
at new (/usr/local/lib/node_modules/quickscrape/node_modules/jsdom/lib/jsdom/browser/utils.js:9:13)
at Object.j as log
at ra (file://connect.facebook.net/en_GB/all.js#xfbml=1:78:2933)
at file://connect.facebook.net/en_GB/all.js#xfbml=1:78:3645
at file://connect.facebook.net/en_GB/all.js#xfbml=1:66:908
at Array.forEach (native)
at w (file://connect.facebook.net/en_GB/all.js#xfbml=1:28:757)
at Object.g.fire (file://connect.facebook.net/en_GB/all.js#xfbml=1:66:868)
at s (file://connect.facebook.net/en_GB/all.js#xfbml=1:124:1269)
at file://connect.facebook.net/en_GB/all.js#xfbml=1:124:1580

Process stuck on a URL

Previous issue arose from trying to diagnose this problem (the --url / -url thing wasn't the problem here). Here my bash:

while read i ; do quickscrape  --url $i --ratelimit 20 --scraper jscrapers/scrapers/ijsem.json  --output ./ijsem --outformat bibjson | tee log.log ; done <ijsemarticles.txt

It basically hung, without crash / exit on the 1000ish line of the ijsemarticles url list. I'll upload the list and link to it here in a bit.

$ tail log.log 
info: quickscrape launched with...
info: - URL: http://ijs.sgmjournals.org/content/64/Pt_5/1775.full
info: - Scraper: jscrapers/scrapers/ijsem.json
info: - Rate limit: 20 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://ijs.sgmjournals.org/content/64/Pt_5/1775.full
info: [scraper]. URL rendered. http://ijs.sgmjournals.org/content/64/Pt_5/1775.full.
info: [scraper]. URL rendered. http://ijs.sgmjournals.org/content/64/Pt_5/1775/suppl/DC1.

I used Ctrl-C to stop the process. Would be good to print the current time to screen in one of the loglevel settings so I know when it hung. I have no idea when it stopped chugging away

"TypeError: Cannot read property 'actions' of null" if wrong scraper used

If the wrong scraper is used quickscrape should fail gracefully:

localhost:elsevier pm286$ quickscrape -u http://www.sciencedirect.com/science/article/pii/S0031942215000965 -s ../../journal-scrapers/scrapers/nature.json 
info: quickscrape launched with...
info: - URL: http://www.sciencedirect.com/science/article/pii/S0031942215000965
info: - Scraper: ../../journal-scrapers/scrapers/nature.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.sciencedirect.com/science/article/pii/S0031942215000965

TypeError: Cannot read property 'actions' of null
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:104:16
    at Request._callback (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60:5)
    at Request.self.callback (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:368:22)
    at Request.emit (events.js:98:17)
    at Request.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1219:14)
    at Request.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1167:12)
    at IncomingMessage.emit (events.js:117:20)
    at _stream_readable.js:944:16
    at process._tickCallback (node.js:448:13)
localhost:elsevier pm286$ 

Multiple URLs: accumulating event listeners

With multiple URLs, the further through the list the scrape gets, the more event listeners that accumulate and print mulitple messages:

fulltext.pdf
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.pdf.
fulltext.pdf
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.pdf.
fig-2-full.png
info: [scraper]. download started. fig-2-full.png.
info: [scraper]. download started. fig-2-full.png.
info: [scraper]. download started. fig-2-full.png.
info: [scraper]. download started. fig-2-full.png.
fulltext.html
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.html.
NMR_spectra.zip
info: [scraper]. download started. NMR_spectra.zip.
info: [scraper]. download started. NMR_spectra.zip.
info: [scraper]. download started. NMR_spectra.zip.
info: [scraper]. download started. NMR_spectra.zip.
fulltext.xml
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.xml.
fulltext.html
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.html.

Lubuntu 15.04 install failed (?)

$ sudo npm install --global quickscrape
-
> [email protected] install /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs
> node install.js

PhantomJS detected, but wrong version 1.9.0 @ /usr/bin/phantomjs.
Downloading https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.8-linux-x86_64.tar.bz2
Saving to /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/phantomjs/phantomjs-1.9.8-linux-x86_64.tar.bz2
Receiving...
  [========================================] 100% 0.0s
Received 12854K total.
Extracting tar contents (via spawned process)
Removing /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/lib/phantom
Copying extracted folder /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/phantomjs/phantomjs-1.9.8-linux-x86_64.tar.bz2-extract-1430748442293/phantomjs-1.9.8-linux-x86_64 -> /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/lib/phantom
Writing location.js file
Done. Phantomjs binary available at /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/lib/phantom/bin/phantomjs
/usr/bin/quickscrape -> /usr/lib/node_modules/quickscrape/bin/quickscrape.js
[email protected] /usr/lib/node_modules/quickscrape
├── [email protected]
├── [email protected] ([email protected])
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected])
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
├── [email protected]
└── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])

$ quickscrape # tab auto-complete works
$ quickscrape --help
$ [no output/response]
$ phantomjs --version
1.9.0
$ nodejs --version
v0.10.33

Also tried installing after installing NVM

curl https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash

but no dice with that either

running quickscrape on ubuntu 14.04 results in "unknown error"

vagrant@vagrant-ubuntu-trusty-32:~$ quickscrape --url https://peerj.com/articles/384 --scraper /vagrant_data/step1_quickscrape-configure/peerj.json --output peerj-384
info:    quickscrape launched with...
info:    - URL: https://peerj.com/articles/384
info:    - Scraper: /vagrant_data/step1_quickscrape-configure/peerj.json
info:    - Rate limit: 3 per minute
info:    - Log level: info
info:    urls to scrape: 1
info:    processing URL: https://peerj.com/articles/384

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: Child terminated with non-zero exit code 127
    at Spooky.<anonymous> (/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/spooky/lib/spooky.js:180:17)
    at ChildProcess.emit (events.js:98:17)
    at Process.ChildProcess._handle.onexit (child_process.js:809:12)

In terms of the rest of the setup - the scraper file is readable, the current directory is writeable (it's ~) and there should be plenty of space:

vagrant@vagrant-ubuntu-trusty-32:~$ ls -l /vagrant_data/step1_quickscrape-configure/peerj.json
-rwxrwx--- 1 vagrant vagrant 2218 Jul 10 08:49 /vagrant_data/step1_quickscrape-configure/peerj.json
vagrant@vagrant-ubuntu-trusty-32:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        40G  1.7G   37G   5% /
vagrant@vagrant-ubuntu-trusty-32:~$ ls -al
total 40
drwxr-xr-x 5 vagrant vagrant 4096 Jul 10 08:53 .

JSON format

Currently, output is formatted like so, a list of unnamed key-value pairs:

  {
    "publisher": "PeerJ Inc."
  },
  {
    "journal": "PeerJ"
  },
  {
    "title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss"
  },
  {
    "authors": "Lynn M. Pique"
  },
  {
    "authors": "Marie-Luise Brennan"
  },
  {
    "authors": "Colin J. Davidson"
  }

Would it make more sense to format output like this?:

 "publisher": {"PeerJ Inc."}
 "journal": "PeerJ"
  "title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss"
  "authors": {"Lynn M. Pique", "Marie-Luise Brennan", "Colin J. Davidson"}

I'm not so much referring to compressed spacing as structure. In the latter format, we have a list of keys and values. The former format adds an unnecessary layer that makes it harder to get things out. Does output.title refer to anything in the former structure? It seems you would need output(1).title. output.authors, on the other hand, should return a vector of all the authors, which can be subset as output.authors(1).

New installation on Ubuntu 14.04 gives TypeError

I've just tried to install quickscrape on a VM (Ubuntu 14.04) using the Ubuntu instructions in README.md.

Although there were warnings I believed the installation completed successfully.
However when I execute the example (peer-384) I get an immediate error message:

/usr/lib/node_modules/quickscrape/bin/quickscrape.js:97
var scrapers = new ScraperBox(program.scraperdir);
^
TypeError: undefined is not a function
at Object. (/usr/lib/node_modules/quickscrape/bin/quickscrape.js:97:16)

which I interpret as a failure to have found 'thresher' but
I confess I have never installed Node.js before and have little Javascript expertise (but I am well used to installing and configuring and programming other stuff).

I suspect that I made some trivial mistake.
I may have done sudo bash at one point and run a step as root with root home rather than me with my home and that seems to matter to npm.

Can you spot anything from the installation, execution, npm ls below?

regards

Tim Parkinson
University of Southampton

==============Cut and Pasted from Installation Window =====================

tsp@bft:~/scraping$ sudo -H npm install --global quickscrape
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.29","npm":"1.4.14"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.29","npm":"1.4.14"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.29","npm":"1.4.14"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.29","npm":"1.4.14"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.29","npm":"1.4.14"})

[email protected] install /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify
node-gyp rebuild

gyp WARN EACCES user "root" does not have permission to access the dev dir "/root/.node-gyp/0.10.29"
gyp WARN EACCES attempting to reinstall using temporary dev dir "/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify/.node-gyp"
make: Entering directory /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify/build' CXX(target) Release/obj.target/contextify/src/contextify.o SOLINK_MODULE(target) Release/obj.target/contextify.node SOLINK_MODULE(target) Release/obj.target/contextify.node: Finished COPY Release/contextify.node make: Leaving directory/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify/build'

[email protected] install /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs
node install.js

Downloading https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
Saving to /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/phantomjs/phantomjs-1.9.7-linux-x86_64.tar.bz2
Receiving...

/usr/lib/node_modules/quickscrape/bin/quickscrape.js:97
var scrapers = new ScraperBox(program.scraperdir);
^
TypeError: undefined is not a function
at Object. (/usr/lib/node_modules/quickscrape/bin/quickscrape.js:97:16)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:906:3
tsp@bft:~/scraping$ sudo -H npm ls
/home/tsp/scraping
└── (empty)

tsp@bft:~/scraping$ sudo npm ls
/home/tsp/scraping
└── (empty)

tsp@bft:/scraping$ cd
tsp@bft:
$ sudo npm ls
/home/tsp
└── (empty)

tsp@bft:~$ sudo npm ls --global
/usr/lib
├─┬ [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └─┬ [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ └─┬ [email protected]
│ │ │ └── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ └── [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ └── [email protected]
└─┬ [email protected]
├── [email protected]
├─┬ [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ ├─┬ [email protected]
│ │ │ │ │ └── [email protected]
│ │ │ │ └── [email protected]
│ │ │ └─┬ [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ └── [email protected]
│ │ └─┬ [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ └─┬ [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ └─┬ [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ ├── [email protected]
│ │ │ │ └── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ └── [email protected]
│ │ │ └── [email protected]
│ │ ├─┬ [email protected]
│ │ │ └── [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├─┬ [email protected]
│ │ │ │ └── [email protected]
│ │ │ └── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ └── [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├─┬ [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ ├── [email protected]
│ │ │ └── [email protected]
│ │ ├── [email protected]
│ │ └── [email protected]
│ └── [email protected]
├── [email protected]
└─┬ [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├─┬ [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├─┬ [email protected]
│ │ └─┬ [email protected]
│ │ └── [email protected]
│ ├─┬ [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ ├── [email protected]
│ │ └── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ └── [email protected]
└── [email protected]

same short arg for urllist and ratelimit -r

quickscrape -r fulltext_html_urls.txt -o output/ -d ../journal-scrapers/scrapers/ -r 5
(works with -r urls.txt --ratelimit 5, but not with --urllist urls.txt -r 5)

fs.js:427
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^
Error: ENOENT, no such file or directory '10'
at Object.fs.openSync (fs.js:427:18)
at Object.fs.readFileSync (fs.js:284:15)
at loadUrls (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:105:17)
at Object. (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:115:41)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)

Streamline installation process

Install should ideally be one-line on any operating system.

At the moment we're close to that but there seem to be bugs on multiple platforms. These are associated with:

  1. getting Node itself installed
  2. npm --global requiring root

I think optimally we would not require sudo at all, so that probably means:

  1. using nvm to install Node
  2. setting npm to do global installs into a bin dir in the user's directory

Generate scraping addresses from URLs or other identfiers

Some publishers have "hidden" or "deeply nested" URLs that do not occur on the landing page. For example Hindawi advertises:

<a href="http://downloads.hindawi.com/journals/ija/2015/426387.epub" class="full_text_epub">
                            Full-Text ePUB</a>

to download ePUB, but there is no explicit XML link. However the analogous:

<a href="http://downloads.hindawi.com/journals/ija/2015/426387.xml" class="full_text_xml">Full-Text XML</a> 

works.
Issue is to create a syntax for generating such addresses from information scraped from the landing page.

Smart domain-based rate limiting

When provided a list of URLs, quickscrape could be smart about the order in which they are scraped in order to speed up the process. Each domain could be hit no more frequently than the allowed rate. A couple of approaches:

  • Prior to scraping, re-organize the URLs to rotate amongst domains.
  • Keep track of domains as you go. Some pseudocode:
IF (last_url.domain == current_URL.domain AND current_time < last_time + min_wait)
  current_URL.addTo(skipped_URLs)
  current_URL = next_URL
IF current_URL = last_URL
  URLs = skipped_URLs

Or something like that.

installation on ubuntu 14.04 server

I'm sure you're extremely keen to hear all about yet more installation woes :).

In Vagrant a lot of things are a bit weird (the commands don't run in a tty for example, but that shouldn't affect anything). Also node can't seem to access some directories - but tries to compensate by reverting to a temporary build dir.

We'll come back to those potential problems if what's below isn't enough to resolve it.

[some permission problems and tty warnings]
[...]
make: g++: Command not found
make: *** [Release/obj.target/contextify/src/contextify.o] Error 127

I think build-essential might be missing from your install instructions - I'll confirm this and reply to this issue.

Urllist providing n+1 urls

When urllist sometimes appears to give n+1 urls to quickscrape, with the +1 url being a null value (resulting in null url scrapes). This crashes quickscrape, with the info and error messages:

info:    processing URL: 
error:   
error:   TypeError: Cannot read property 'elements' of null
    at processUrl (/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js:137:29)
    at null._onTimeout (/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js:169:5)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)

I haven't figured out if it's newline encoding or something else. I'm on OS X, editing a urllist.txt in vim causes problems, but in TextEdit (after removing the tailing blank line not seen in vim) it runs ok.

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

A re-occurrence of #9 ?

I fed it a list of ~3000 PNAS full text URLs last night and it choked after just 255
No crash file generated in /var/crash/

(quickscrape 0.4.2)

info: processing URL: http://www.pnas.org/content/100/19/10866.full
info: [scraper]. URL rendered. http://www.pnas.org/content/100/19/10866.full.
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
$

input URLs:
http://www.pnas.org/content/100/19/10860.full (262nd)
http://www.pnas.org/content/100/19/10866.full (263rd)
http://www.pnas.org/content/100/19/10872.full (264th)

output folder
of 10860 is created and full of files as expected,
of 10866 is created but empty

the other slightly worrying thing is it progressed to the 263rd URL, but there are only 255 output folders in the output directory. Is there an option to keep a logfile of the scrape, in-built?

input commands:

$ quickscrape --urllist 3000PNAS.txt --scraper journal-scrapers/scrapers/pnas.json --output cm --outformat bibjson --ratelimit 20

Nothing to do with that particular URL. It works fine when quickscraping that one individually (I've tried, it works).

Quickscrape halts with no error messages after 'processing URL: ...'

daniel@teaspoon:~/Dropbox/projects/content_mine$ quickscrape --url http://pubs.acs.org/doi/abs/10.1021/ci030304f --scraper /Uss/daniel/Dropbox/projects/content_mine/journal-scrapers/scrapers/acs.json --output first --outformat bibjson
info: quickscrape 0.4.6 launched with...
info: - URL: http://pubs.acs.org/doi/abs/10.1021/ci030304f
info: - Scraper: /Users/daniel/Dropbox/projects/content_mine/journal-scrapers/scrapers/acs.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://pubs.acs.org/doi/abs/10.1021/ci030304f

after this it just stalls and I have to break out using Ctrl-C. There are no debug messages.

I'm using OSX 10.7

Ubuntu 12.04 LTS install fail

Ubuntu doesn't come with npm pre-installed so you should add to the readme:

apt-get install npm

after install the pre-requisites...

ross@ross:~/workspace/quickscrape$ sudo npm install --global casperjs quickscrape
npm http GET https://registry.npmjs.org/casperjs
npm http GET https://registry.npmjs.org/quickscrape

npm ERR! Error: failed to fetch from registry: quickscrape
npm ERR! at /usr/share/npm/lib/utils/npm-registry-client/get.js:139:12
npm ERR! at cb (/usr/share/npm/lib/utils/npm-registry-client/request.js:31:9)
npm ERR! at Request._callback (/usr/share/npm/lib/utils/npm-registry-client/request.js:136:18)
npm ERR! at Request.callback (/usr/lib/nodejs/request/main.js:119:22)
npm ERR! at Request. (/usr/lib/nodejs/request/main.js:212:58)
npm ERR! at Request.emit (events.js:88:20)
npm ERR! at ClientRequest. (/usr/lib/nodejs/request/main.js:412:12)
npm ERR! at ClientRequest.emit (events.js:67:17)
npm ERR! at HTTPParser.onIncoming (http.js:1261:11)
npm ERR! at HTTPParser.onHeadersComplete (http.js:102:31)
npm ERR! You may report this log at:
npm ERR! http://bugs.debian.org/npm
npm ERR! or use
npm ERR! reportbug --attach /home/ross/workspace/quickscrape/npm-debug.log npm
npm ERR!
npm ERR! System Linux 3.8.0-39-generic
npm ERR! command "node" "/usr/bin/npm" "install" "--global" "casperjs" "quickscrape"
npm ERR! cwd /home/ross/workspace/quickscrape
npm ERR! node -v v0.6.12
npm ERR! npm -v 1.1.4
npm ERR! message failed to fetch from registry: quickscrape
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/ross/workspace/quickscrape/npm-debug.log
npm not ok
ross@ross:~/workspace/quickscrape$

npm-debug.log file contains:

info it worked if it ends with ok
verbose cli [ 'node',
verbose cli '/usr/bin/npm',
verbose cli 'install',
verbose cli '--global',
verbose cli 'casperjs',
verbose cli 'quickscrape' ]
info using [email protected]
info using [email protected]
verbose config file /home/ross/.npmrc
verbose config file /usr/etc/npmrc
verbose config file /usr/share/npm/npmrc
silly exec /usr/bin/node "/usr/share/npm/bin/npm-get-uid-gid.js" "nobody" 1000
silly spawning [ '/usr/bin/node',
silly spawning [ '/usr/share/npm/bin/npm-get-uid-gid.js', 'nobody', 1000 ],
silly spawning null ]
silly output from getuid/gid {"uid":65534,"gid":1000}
silly output from getuid/gid
verbose cache add [ 'casperjs', null ]
silly cache add: name, spec, args [ undefined, 'casperjs', [ 'casperjs', null ] ]
verbose parsed url { pathname: 'casperjs', path: 'casperjs', href: 'casperjs' }
verbose cache add [ 'quickscrape', null ]
silly cache add: name, spec, args [ undefined, 'quickscrape', [ 'quickscrape', null ] ]
verbose parsed url { pathname: 'quickscrape',
verbose parsed url path: 'quickscrape',
verbose parsed url href: 'quickscrape' }
verbose addNamed [ 'casperjs', '' ]
verbose addNamed [ null, '' ]
silly name, range, hasData [ 'casperjs', '', false ]
verbose addNamed [ 'quickscrape', '' ]
verbose addNamed [ null, '' ]
silly name, range, hasData [ 'quickscrape', '', false ]
verbose raw, before any munging casperjs
verbose url resolving [ 'https://registry.npmjs.org/', './casperjs' ]
verbose url resolved https://registry.npmjs.org/casperjs
http GET https://registry.npmjs.org/casperjs
verbose raw, before any munging quickscrape
verbose url resolving [ 'https://registry.npmjs.org/', './quickscrape' ]
verbose url resolved https://registry.npmjs.org/quickscrape
http GET https://registry.npmjs.org/quickscrape
ERR! Error: failed to fetch from registry: quickscrape
ERR! at /usr/share/npm/lib/utils/npm-registry-client/get.js:139:12
ERR! at cb (/usr/share/npm/lib/utils/npm-registry-client/request.js:31:9)
ERR! at Request._callback (/usr/share/npm/lib/utils/npm-registry-client/request.js:136:18)
ERR! at Request.callback (/usr/lib/nodejs/request/main.js:119:22)
ERR! at Request. (/usr/lib/nodejs/request/main.js:212:58)
ERR! at Request.emit (events.js:88:20)
ERR! at ClientRequest. (/usr/lib/nodejs/request/main.js:412:12)
ERR! at ClientRequest.emit (events.js:67:17)
ERR! at HTTPParser.onIncoming (http.js:1261:11)
ERR! at HTTPParser.onHeadersComplete (http.js:102:31)
ERR! You may report this log at:
ERR! http://bugs.debian.org/npm
ERR! or use
ERR! reportbug --attach /home/ross/workspace/quickscrape/npm-debug.log npm
ERR!
ERR! System Linux 3.8.0-39-generic
ERR! command "node" "/usr/bin/npm" "install" "--global" "casperjs" "quickscrape"
ERR! cwd /home/ross/workspace/quickscrape
ERR! node -v v0.6.12
ERR! npm -v 1.1.4
ERR! message failed to fetch from registry: quickscrape
verbose exit [ 1, true ]

my system details:

lsb_release -a
LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:qt4-3.1-amd64:qt4-3.1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise

$ node --version
v0.6.12

$ nodejs
No command 'nodejs' found, did you mean:
Command 'nodefs' from package 'noweb' (main)
nodejs: command not found

warning: possible EventEmitter memory leak detected. 11 listeners added.

was scraping a short list of urls (18), 10 went through, at the 11th came this warning. quickscrape then continued to the end of the list

(node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
Trace
at growListenerTree (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:206:23)
at EventEmitter.on (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:366:24)
at Thresher.scrape (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:68:11)
at processUrl (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:181:7)
at null._onTimeout (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:204:5)
at Timer.listOnTimeout as ontimeout
(node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
Trace
at growListenerTree (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:206:23)
at EventEmitter.on (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:366:24)
at Thresher.scrape (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:72:11)
at processUrl (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:181:7)
at null._onTimeout (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:204:5)
at Timer.listOnTimeout as ontimeout

Timeout hangs scraping

With Richard's example, and on a slow wifi (it may even drop) I get timeouts. The process seems to hang - here's a typical output:

okapi:quickscrape pm286$ quickscrape   --urllist urls.txt   --scraper ../journal-scrapers/molecules_figures.json
info:    quickscrape launched with...
info:    - URLs from file: undefined
info:    - Scraper: ../journal-scrapers/molecules_figures.json
info:    - Rate limit: 3 per minute
info:    - Log level: info
info:    urls to scrape: 6
info:    processing URL: http://www.mdpi.com/1420-3049/19/2/2042/htm
data:    dc.source: Molecules 2014, Vol. 19, Pages 2042-2048
data:    figure_img: file:///molecules/molecules-19-02042/article_deploy/html/images/molecules-19-02042-g001-1024.png
data:    figure_img: file:///molecules/molecules-19-02042/article_deploy/html/images/molecules-19-02042-g002-1024.png
data:    figure_caption: Figure 1. Chemical structures of compounds 1–6. Click here to enlarge figure
data:    figure_caption: Figure 2. Key HMBC and 1H-1H COSY correlations of 1 and 1a. Click here to enlarge figure
data:    fulltext_pdf: http://www.mdpi.com/1420-3049/19/2/2042/pdf
data:    fulltext_html: http://www.mdpi.com/1420-3049/19/2/2042/htm
data:    title: Coumarins from Edgeworthia chrysantha
data:    date: 2014-02-13
data:    doi: 10.3390/molecules19022042
data:    volume: 19
data:    issue: 2
data:    firstpage: 2042
data:    description: A new coumarin, edgeworic acid (1), was isolated from the flower buds of Edgeworthia chrysantha, together with the five known coumarins umbelliferone (2), 5,7-dimethoxycoumarin (3), daphnoretin (4), edgeworoside C (5), and edgeworoside A (6). Their structures were established on the basis of spectral data, particularly by the use of 1D NMR and several 2D shift-correlated NMR pulse sequences (1H-1H COSY, HSQC and HMBC), in combination with acetylation reactions.
info:    waiting for 4 downloads to complete in background
error:   file download failed: Error: read ECONNRESET
error:   file download failed: Error: read ECONNRESET
info:    waiting 20 seconds before next scrape
info:    processing URL: http://www.mdpi.com/1420-3049/19/2/2049/htm
data:    dc.source: Molecules 2014, Vol. 19, Pages 2049-2060
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g001-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g002-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g003-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g004-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g005-1024.png
data:    figure_caption: Figure 1. Compounds 1–8 isolated from the Indian Mast Tree Polyalthia longifolia var. pendula. Click here to enlarge figure
data:    figure_caption: Figure 2. Selected HMBC ( ) and COSY ( ) correlations of compounds 1–2. Click here to enlarge figure
data:    figure_caption: Figure 3. Selected NOESY correlations of compounds 1–2. Click here to enlarge figure
data:    figure_caption: Figure 4. Effect of 6 and 7 isolated from P. longifolia var. pedula on the expression of RAW 264.7 NO. RAW 264.7 macrophages (5 × 105/mL) were pre-treated with compounds 6 and 7, and DMSO (control) for 30 min, followed by stimulation with LPS (1 µg/mL) for 24 h. NO concentration in the culture medium was assayed by the Griess reaction. The data were expressed as the means ± S.E. from three separate experiments. Click here to enlarge figure
data:    figure_caption: Figure 5. Effect of 6 and 7 isolated from P. longifolia var. pedula on cell viability. RAW 264.7 macrophages (5 × 103/well) were treated with compounds 6 and 7, DMSO (control) in the presence or absence of LPS (1 µg/mL) for 24 h, followed by incubating with MTT reagent. After 30 min of incubation, the absorbance (A550 − A690) was measured by spectrophotometry [26]. The data were expressed as the means ± S.E. from three separate experiments. Click here to enlarge figure
data:    fulltext_pdf: http://www.mdpi.com/1420-3049/19/2/2049/pdf
data:    fulltext_html: http://www.mdpi.com/1420-3049/19/2/2049/htm
data:    title: Three New Clerodane Diterpenes from Polyalthia longifolia var. pendula
data:    date: 2014-02-13
data:    doi: 10.3390/molecules19022049
data:    volume: 19
data:    issue: 2
data:    firstpage: 2049
data:    description: Three new clerodane diterpenes, (4→2)-abeo-cleroda-2,13E-dien-2,14-dioic acid (1), (4→2)-abeo-2,13-diformyl-cleroda-2,13E-dien-14-oic acid (2), and 16(R&amp;S)- methoxycleroda-4(18),13-dien-15,16-olide (3), were isolated from the unripe fruit of Polyalthia longifolia var. pendula (Annonaceae) together with five known compounds (4–8). The structures of all isolates were determined by spectroscopic analysis. The anti-inflammatory activity of the isolates was evaluated by testing their inhibitory effect on NO production in LPS-stimulated RAW 264.7 macrophages. Among the isolated compounds, 16-hydroxycleroda-3,13-dien-15,16-olide (6) and 16-oxocleroda-3,13-dien-15-oic acid (7) showed promising NO inhibitory activity at 10 µg/mL, with 81.1% and 86.3%, inhibition, respectively.
info:    waiting for 7 downloads to complete in background
error:   file download failed: Error: read ECONNRESET
info:    waiting 20 seconds before next scrape
info:    processing URL: http://www.mdpi.com/1420-3049/19/2/2061/htm

Running test on OpenSuse 13.1 64bit

I get this error while running a test on a single URL:

quickscrape \
  --url https://peerj.com/articles/384 \
  --scraper journal-scrapers/scrapers/peerj.json \
  --output peerj-384 -l debug

info:    quickscrape launched with...
info:    - URL: https://peerj.com/articles/384
info:    - Scraper: journal-scrapers/scrapers/peerj.json   
info:    - Rate limit: 3 per minute
info:    - Log level: debug
info:    urls to scrape: 1


/usr/lib/node_modules/quickscrape/bin/quickscrape.js:97 
var scrapers = new ScraperBox(program.scraperdir);
             ^
TypeError: undefined is not a function
    at Object.<anonymous> (/usr/lib/node_modules/quickscrape/bin/quickscrape.js:97:16)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:901:3

The node version is v0.10.5
commander thresher which winston are installed in node_modules in the global environment (/usr/lib/node_modules/quickscrape/).
I installed everything correctly in node via npm, am I missing some system library?


INSTALLATION LOGS

npm http GET https://registry.npmjs.org/quickscrape
npm http 304 https://registry.npmjs.org/quickscrape
npm http GET https://registry.npmjs.org/commander
npm http GET https://registry.npmjs.org/which
npm http GET https://registry.npmjs.org/winston
npm http GET https://registry.npmjs.org/thresher
npm http 304 https://registry.npmjs.org/which
npm http 304 https://registry.npmjs.org/commander
npm http 304 https://registry.npmjs.org/thresher
npm http 304 https://registry.npmjs.org/winston
npm http GET https://registry.npmjs.org/casperjs
npm http GET https://registry.npmjs.org/download
npm http GET https://registry.npmjs.org/jsdom
npm http GET https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/phantomjs
npm http GET https://registry.npmjs.org/shelljs
npm http GET https://registry.npmjs.org/spooky
npm http GET https://registry.npmjs.org/xpath/0.0.6
npm http GET https://registry.npmjs.org/jsdom-little
npm http GET https://registry.npmjs.org/async
npm http GET https://registry.npmjs.org/colors
npm http GET https://registry.npmjs.org/cycle
npm http GET https://registry.npmjs.org/eyes
npm http GET https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/pkginfo
npm http GET https://registry.npmjs.org/stack-trace
npm http 304 https://registry.npmjs.org/request
npm http 304 https://registry.npmjs.org/phantomjs
npm http 304 https://registry.npmjs.org/jsdom
npm http 304 https://registry.npmjs.org/shelljs
npm http 304 https://registry.npmjs.org/jsdom-little
npm http 304 https://registry.npmjs.org/async
npm http 304 https://registry.npmjs.org/download
npm http 304 https://registry.npmjs.org/colors
npm http 304 https://registry.npmjs.org/spooky
npm http 304 https://registry.npmjs.org/eyes
npm http 304 https://registry.npmjs.org/xpath/0.0.6
npm http 304 https://registry.npmjs.org/request
npm http 304 https://registry.npmjs.org/casperjs
npm http 304 https://registry.npmjs.org/stack-trace
npm http GET https://registry.npmjs.org/decompress
npm http GET https://registry.npmjs.org/each-async
npm http GET https://registry.npmjs.org/get-stdin
npm http GET https://registry.npmjs.org/get-urls
npm http 304 https://registry.npmjs.org/pkginfo
npm http GET https://registry.npmjs.org/mkdirp
npm http GET https://registry.npmjs.org/nopt
npm http GET https://registry.npmjs.org/through2
npm http GET https://registry.npmjs.org/adm-zip/0.2.1
npm http GET https://registry.npmjs.org/kew
npm http GET https://registry.npmjs.org/ncp/0.4.2
npm http GET https://registry.npmjs.org/npmconf/0.0.24
npm http GET https://registry.npmjs.org/mkdirp/0.3.5
npm http GET https://registry.npmjs.org/progress
npm http GET https://registry.npmjs.org/request/2.36.0
npm http GET https://registry.npmjs.org/request-progress
npm http GET https://registry.npmjs.org/rimraf
npm http GET https://registry.npmjs.org/json-stringify-safe
npm http GET https://registry.npmjs.org/mime-types
npm http GET https://registry.npmjs.org/qs
npm http GET https://registry.npmjs.org/forever-agent
npm http GET https://registry.npmjs.org/node-uuid
npm http GET https://registry.npmjs.org/tough-cookie
npm http GET https://registry.npmjs.org/form-data
npm http GET https://registry.npmjs.org/tunnel-agent
npm http GET https://registry.npmjs.org/http-signature
npm http GET https://registry.npmjs.org/oauth-sign
npm http GET https://registry.npmjs.org/hawk/1.1.1
npm http GET https://registry.npmjs.org/aws-sign2
npm http GET https://registry.npmjs.org/underscore
npm http GET https://registry.npmjs.org/tiny-jsonrpc
npm http GET https://registry.npmjs.org/carrier
npm http GET https://registry.npmjs.org/duplexer
npm http GET https://registry.npmjs.org/readable-stream
npm http 304 https://registry.npmjs.org/cycle
npm http GET https://registry.npmjs.org/htmlparser2
npm http GET https://registry.npmjs.org/nwmatcher
npm http GET https://registry.npmjs.org/cssom
npm http GET https://registry.npmjs.org/cssstyle
npm http GET https://registry.npmjs.org/xmlhttprequest
npm http GET https://registry.npmjs.org/contextify
npm http 304 https://registry.npmjs.org/mkdirp
npm http GET https://registry.npmjs.org/form-data
npm http GET https://registry.npmjs.org/mime
npm http GET https://registry.npmjs.org/hawk
npm http GET https://registry.npmjs.org/cookie-jar
npm http GET https://registry.npmjs.org/oauth-sign
npm http GET https://registry.npmjs.org/aws-sign
npm http GET https://registry.npmjs.org/forever-agent
npm http GET https://registry.npmjs.org/tunnel-agent
npm http GET https://registry.npmjs.org/json-stringify-safe
npm http GET https://registry.npmjs.org/qs
npm http 304 https://registry.npmjs.org/nopt
npm http 304 https://registry.npmjs.org/get-urls
npm http 304 https://registry.npmjs.org/through2
npm http 304 https://registry.npmjs.org/adm-zip/0.2.1
npm http 304 https://registry.npmjs.org/kew
npm http 304 https://registry.npmjs.org/get-stdin
npm http 304 https://registry.npmjs.org/decompress
npm http 304 https://registry.npmjs.org/ncp/0.4.2
npm http 304 https://registry.npmjs.org/npmconf/0.0.24
npm http 304 https://registry.npmjs.org/mkdirp/0.3.5
npm http 304 https://registry.npmjs.org/progress
npm http 304 https://registry.npmjs.org/request/2.36.0
npm http 304 https://registry.npmjs.org/json-stringify-safe
npm http 304 https://registry.npmjs.org/request-progress
npm http 304 https://registry.npmjs.org/rimraf
npm http GET https://registry.npmjs.org/throttleit
npm http GET https://registry.npmjs.org/config-chain
npm http GET https://registry.npmjs.org/inherits
npm http GET https://registry.npmjs.org/osenv/0.0.3
npm http GET https://registry.npmjs.org/once
npm http GET https://registry.npmjs.org/semver
npm http GET https://registry.npmjs.org/ini
npm http 304 https://registry.npmjs.org/forever-agent
npm http 304 https://registry.npmjs.org/qs
npm http 304 https://registry.npmjs.org/mime-types
npm http 304 https://registry.npmjs.org/node-uuid
npm http GET https://registry.npmjs.org/mime
npm http GET https://registry.npmjs.org/hawk
npm http 304 https://registry.npmjs.org/http-signature
npm http 304 https://registry.npmjs.org/tough-cookie
npm http 304 https://registry.npmjs.org/form-data
npm http 304 https://registry.npmjs.org/tunnel-agent
npm http 304 https://registry.npmjs.org/each-async
npm http GET https://registry.npmjs.org/adm-zip
npm http GET https://registry.npmjs.org/extname
npm http GET https://registry.npmjs.org/map-key
npm http GET https://registry.npmjs.org/stream-combiner
npm http GET https://registry.npmjs.org/tar
npm http GET https://registry.npmjs.org/tempfile
npm http GET https://registry.npmjs.org/readable-stream
npm http GET https://registry.npmjs.org/xtend
npm http GET https://registry.npmjs.org/abbrev
npm http 304 https://registry.npmjs.org/oauth-sign
npm http 200 https://registry.npmjs.org/hawk/1.1.1
npm http 304 https://registry.npmjs.org/aws-sign2
npm http GET https://registry.npmjs.org/hawk/-/hawk-1.1.1.tgz
npm http 200 https://registry.npmjs.org/underscore
npm http 304 https://registry.npmjs.org/duplexer
npm http 304 https://registry.npmjs.org/carrier
npm http 304 https://registry.npmjs.org/htmlparser2
npm http 304 https://registry.npmjs.org/tiny-jsonrpc
npm http 304 https://registry.npmjs.org/cssom
npm http 304 https://registry.npmjs.org/cssstyle
npm http 304 https://registry.npmjs.org/readable-stream
npm http 200 https://registry.npmjs.org/hawk/-/hawk-1.1.1.tgz
npm http 304 https://registry.npmjs.org/form-data
npm http 304 https://registry.npmjs.org/contextify
npm http 304 https://registry.npmjs.org/xmlhttprequest
npm http GET https://registry.npmjs.org/inherits
npm http GET https://registry.npmjs.org/string_decoder
npm http GET https://registry.npmjs.org/core-util-is
npm http GET https://registry.npmjs.org/isarray/0.0.1
npm http 304 https://registry.npmjs.org/mime
npm http 304 https://registry.npmjs.org/nwmatcher
npm http GET https://registry.npmjs.org/bindings
npm http GET https://registry.npmjs.org/nan
npm http 304 https://registry.npmjs.org/cookie-jar
npm http 304 https://registry.npmjs.org/oauth-sign
npm http 304 https://registry.npmjs.org/forever-agent
npm http 304 https://registry.npmjs.org/aws-sign
npm http GET https://registry.npmjs.org/domhandler
npm http GET https://registry.npmjs.org/domutils
npm http GET https://registry.npmjs.org/domelementtype
npm http 304 https://registry.npmjs.org/tunnel-agent
npm http GET https://registry.npmjs.org/entities
npm http 304 https://registry.npmjs.org/json-stringify-safe
npm http 304 https://registry.npmjs.org/qs
npm http 304 https://registry.npmjs.org/throttleit
npm http 200 https://registry.npmjs.org/hawk
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.5","npm":"1.4.6"})
npm http 304 https://registry.npmjs.org/config-chain
npm http 304 https://registry.npmjs.org/once
npm http 304 https://registry.npmjs.org/osenv/0.0.3
npm http 304 https://registry.npmjs.org/inherits
npm http GET https://registry.npmjs.org/combined-stream
npm http 304 https://registry.npmjs.org/semver
npm http GET https://registry.npmjs.org/hoek
npm http GET https://registry.npmjs.org/boom
npm http GET https://registry.npmjs.org/cryptiles
npm http GET https://registry.npmjs.org/sntp
npm http 304 https://registry.npmjs.org/mime
npm http 304 https://registry.npmjs.org/ini
npm http 304 https://registry.npmjs.org/adm-zip
npm http GET https://registry.npmjs.org/proto-list
npm http 304 https://registry.npmjs.org/stream-combiner
npm http 304 https://registry.npmjs.org/tar
npm http 304 https://registry.npmjs.org/extname
npm http GET https://registry.npmjs.org/asn1/0.1.11
npm http GET https://registry.npmjs.org/assert-plus/0.1.2
npm http GET https://registry.npmjs.org/ctype/0.5.2
npm http 200 https://registry.npmjs.org/hawk
npm http 304 https://registry.npmjs.org/readable-stream
npm http 304 https://registry.npmjs.org/xtend
npm http 304 https://registry.npmjs.org/tempfile
npm http 304 https://registry.npmjs.org/abbrev
npm http 304 https://registry.npmjs.org/inherits
npm http 200 https://registry.npmjs.org/string_decoder
npm http GET https://registry.npmjs.org/object-keys
npm http GET https://registry.npmjs.org/punycode
npm http 304 https://registry.npmjs.org/core-util-is
npm http GET https://registry.npmjs.org/hoek
npm http GET https://registry.npmjs.org/boom
npm http GET https://registry.npmjs.org/cryptiles
npm http 304 https://registry.npmjs.org/isarray/0.0.1
npm http GET https://registry.npmjs.org/sntp
npm http 304 https://registry.npmjs.org/nan
npm http 304 https://registry.npmjs.org/bindings

[email protected] install /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify
node-gyp rebuild

gyp WARN EACCES user "root" does not have permission to access the dev dir "/root/.node-gyp/0.10.5"
gyp WARN EACCES attempting to reinstall using temporary dev dir "/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify/.node-gyp"
gyp http GET http://nodejs.org/dist/v0.10.5/node-v0.10.5.tar.gz
gyp http 200 http://nodejs.org/dist/v0.10.5/node-v0.10.5.tar.gz
gyp http GET http://nodejs.org/dist/v0.10.5/SHASUMS.txt
gyp http GET http://nodejs.org/dist/v0.10.5/SHASUMS.txt
gyp http 200 http://nodejs.org/dist/v0.10.5/SHASUMS.txt
gyp http 200 http://nodejs.org/dist/v0.10.5/SHASUMS.txt
make: Entering directory /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify/build' CXX(target) Release/obj.target/contextify/src/contextify.o SOLINK_MODULE(target) Release/obj.target/contextify.node SOLINK_MODULE(target) Release/obj.target/contextify.node: Finished COPY Release/contextify.node make: Leaving directory/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify/build'
npm http 304 https://registry.npmjs.org/entities
npm http 304 https://registry.npmjs.org/domhandler
npm http 304 https://registry.npmjs.org/domutils
npm http 304 https://registry.npmjs.org/combined-stream
npm http 304 https://registry.npmjs.org/map-key
npm http 304 https://registry.npmjs.org/domelementtype
npm http GET https://registry.npmjs.org/uuid
npm http GET https://registry.npmjs.org/lodash
npm http GET https://registry.npmjs.org/underscore.string
npm http GET https://registry.npmjs.org/underscore.string
npm http GET https://registry.npmjs.org/ext-list
npm http GET https://registry.npmjs.org/delayed-stream/0.0.5
npm http 200 https://registry.npmjs.org/boom
npm http 304 https://registry.npmjs.org/proto-list
npm http 200 https://registry.npmjs.org/cryptiles
npm http 200 https://registry.npmjs.org/sntp
npm http 304 https://registry.npmjs.org/asn1/0.1.11
npm http 304 https://registry.npmjs.org/assert-plus/0.1.2
npm http 200 https://registry.npmjs.org/hoek
npm http 304 https://registry.npmjs.org/ctype/0.5.2
npm http 304 https://registry.npmjs.org/object-keys
npm http 200 https://registry.npmjs.org/punycode
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.5","npm":"1.4.6"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.5","npm":"1.4.6"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.5","npm":"1.4.6"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.5","npm":"1.4.6"})
npm http GET https://registry.npmjs.org/block-stream
npm http GET https://registry.npmjs.org/fstream
npm http 304 https://registry.npmjs.org/uuid
npm http 200 https://registry.npmjs.org/hoek
npm http 200 https://registry.npmjs.org/boom
npm http 200 https://registry.npmjs.org/cryptiles
npm http 200 https://registry.npmjs.org/sntp
npm http 304 https://registry.npmjs.org/lodash
npm http 304 https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/delayed-stream/0.0.5
npm http 304 https://registry.npmjs.org/block-stream
npm http 200 https://registry.npmjs.org/fstream
npm http GET https://registry.npmjs.org/graceful-fs
npm http 304 https://registry.npmjs.org/ext-list
npm http 304 https://registry.npmjs.org/graceful-fs
npm http GET https://registry.npmjs.org/minimist/0.0.8
npm http 304 https://registry.npmjs.org/minimist/0.0.8

[email protected] install /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs
node install.js

Downloading https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
Saving to /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/phantomjs/phantomjs-1.9.7-linux-x86_64.tar.bz2
Receiving...
[======================================-] 98% 0.0s
Received 12852K total.
Extracting tar contents (via spawned process)
Copying extracted folder /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/phantomjs/phantomjs-1.9.7-linux-x86_64.tar.bz2-extract-1406977951406/phantomjs-1.9.7-linux-x86_64 -> /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/lib/phantom
Writing location.js file
Done. Phantomjs binary available at /usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/phantomjs/lib/phantom/bin/phantomjs
/usr/bin/quickscrape -> /usr/lib/node_modules/quickscrape/bin/quickscrape.js
[email protected] /usr/lib/node_modules/quickscrape
├── [email protected]
├── [email protected]
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
└── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

Tried scraping 90 urls from Frontiers but core dumped after ~69 of them!
Not sure of the reproducibility of this bug, but I'll file it anyway...

quickscrape --urllist 90.txt --scraper journal-scrapers/generic_open.json

got stuck at:
http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full

I can see the 'rendered.html' & the 'results.json' files but no 'full' or 'pdf' so I guess it choked somehow when attempting to get those?

tail of terminal:

info:    waiting 20 seconds before next scrape
info:    processing URL: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full
data:    fulltext_pdf: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/pdf
data:    fulltext_html: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full
data:    title: Bat guilds, a concept to classify the highly diverse foraging and echolocation behaviors of microchiropteran bats
data:    author: Denzinger, Annette
data:    author: Schnitzler, Hans-Ulrich
data:    date: 2013
data:    doi: 10.3389/fphys.2013.00164
data:    volume: 4
data:    description: Throughout evolution the foraging and echolocation behaviors as well as the motor systems of bats have been adapted to the tasks they have to perform while searching and acquiring food. When bats exploit the same class of environmental resources in a similar way, they perform comparable tasks and thus share similar adaptations independent of their phylogeny. Species with similar adaptations are assigned to guilds or functional groups. Habitat type and foraging mode mainly determine the foraging tasks and thus the adaptations of bats. Therefore we use habitat type and foraging mode to define seven guilds. The habitat types open, edge and narrow space are defined according to the bats’ echolocation behavior in relation to the distance between bat and background or food item and background. Bats foraging in the aerial, trawling, flutter detecting, or active gleaning mode use only echolocation to acquire their food. When foraging in the passive gleaning mode bats do not use echolocation but rely on sensory cues from the food item to find it. Bat communities often comprise large numbers of species with a high diversity in foraging areas, foraging modes, and diets. The assignment of species living under similar constraints into guilds identifies pattern of community structure and helps to understand the factors that underlie the organization of highly diverse bat communities. Bat species from different guilds do not compete for food as they differ in their foraging behavior and in the environmental resources they use. However, sympatric living species belonging to the same guild often exploit the same class of resources. To avoid competition they should differ in their niche dimensions. The fine grain structure of bat communities below the rather coarse classification into guilds is determined by mechanisms that result in niche partitioning.
info:    waiting for 2 downloads to complete in background
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)

The 90 URLs are here (order preserved, articles that contain the word 'phylogeny'):

http://journal.frontiersin.org/Journal/10.3389/fcimb.2012.00057/full
http://journal.frontiersin.org/Journal/10.3389/fcimb.2012.00098/full
http://journal.frontiersin.org/Journal/10.3389/fcimb.2012.00133/full
http://journal.frontiersin.org/Journal/10.3389/fendo.2012.00131/full
http://journal.frontiersin.org/Journal/10.3389/fendo.2012.00173/full
http://journal.frontiersin.org/Journal/10.3389/fendo.2014.00072/full
http://journal.frontiersin.org/Journal/10.3389/fevo.2013.00001/full
http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00011/full
http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00012/full
http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00016/full
http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00026/full
http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00027/full
http://journal.frontiersin.org/Journal/10.3389/fgene.2011.00053/full
http://journal.frontiersin.org/Journal/10.3389/fgene.2011.00069/full
http://journal.frontiersin.org/Journal/10.3389/fgene.2011.00072/full
http://journal.frontiersin.org/Journal/10.3389/fgene.2012.00301/full
http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00004/full
http://journal.frontiersin.org/Journal/10.3389/fimmu.2012.00024/full
http://journal.frontiersin.org/Journal/10.3389/fimmu.2012.00136/full
http://journal.frontiersin.org/Journal/10.3389/fimmu.2013.00122/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00053/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00063/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00090/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00116/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00132/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00168/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00213/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00266/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00278/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00305/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00405/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00444/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00084/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00095/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00151/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00190/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00192/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00217/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00291/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00322/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00330/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00366/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00381/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00413/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00414/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00013/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00037/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00076/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00112/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00173/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00223/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00256/full
http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00298/full
http://journal.frontiersin.org/Journal/10.3389/fnana.2011.00007/full
http://journal.frontiersin.org/Journal/10.3389/fnana.2012.00017/full
http://journal.frontiersin.org/Journal/10.3389/fnana.2012.00050/full
http://journal.frontiersin.org/Journal/10.3389/fncel.2013.00247/full
http://journal.frontiersin.org/Journal/10.3389/fncir.2013.00178/full
http://journal.frontiersin.org/Journal/10.3389/fnevo.2011.00002/full
http://journal.frontiersin.org/Journal/10.3389/fnhum.2011.00053/full
http://journal.frontiersin.org/Journal/10.3389/fnhum.2013.00245/full
http://journal.frontiersin.org/Journal/10.3389/fnhum.2014.00345/full
http://journal.frontiersin.org/Journal/10.3389/fnins.2011.00138/full
http://journal.frontiersin.org/Journal/10.3389/fnins.2012.00118/full
http://journal.frontiersin.org/Journal/10.3389/fnmol.2011.00052/full
http://journal.frontiersin.org/Journal/10.3389/fnmol.2014.00048/full
http://journal.frontiersin.org/Journal/10.3389/fnsys.2011.00073/full
http://journal.frontiersin.org/Journal/10.3389/fphar.2012.00115/full
http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full
http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00342/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2011.00005/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2011.00011/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2011.00110/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00001/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00022/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00159/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00227/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00250/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00261/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00327/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00367/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00377/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00386/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00547/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2014.00179/full
http://journal.frontiersin.org/Journal/10.3389/fpls.2014.00296/full
http://journal.frontiersin.org/Journal/10.3389/fpsyg.2014.00163/full
http://journal.frontiersin.org/Journal/10.3389/fpsyg.2014.00282/full
http://journal.frontiersin.org/Journal/10.3389/fpubh.2014.00043/full
http://journal.frontiersin.org/Journal/10.3389/neuro.02.026.2009/full

OSX install failed.

Followed instructions for OSX from README but npm install threw permission errors.

Lubuntu 14.04 LTS, post-install no-response

I think it installs ok(?)

but now, after that... how to get it to work? I tried the instructions on the readme but couldn't get any response from quickscrape, even:

quickscrape --version

gives no output.

Full install log below:

:~/workspace/quickscrape$ sudo npm install --global --unsafe-perm quickscrape
[sudo] password for ross:
npm http GET https://registry.npmjs.org/quickscrape
npm http 304 https://registry.npmjs.org/quickscrape
npm http GET https://registry.npmjs.org/jsdom
npm http GET https://registry.npmjs.org/xpath/0.0.6
npm http GET https://registry.npmjs.org/download
npm http GET https://registry.npmjs.org/winston
npm http GET https://registry.npmjs.org/which
npm http GET https://registry.npmjs.org/commander
npm http GET https://registry.npmjs.org/spooky
npm http 304 https://registry.npmjs.org/jsdom
npm http 304 https://registry.npmjs.org/xpath/0.0.6
npm http 304 https://registry.npmjs.org/which
npm http 304 https://registry.npmjs.org/download
npm http 304 https://registry.npmjs.org/winston
npm http 304 https://registry.npmjs.org/commander
npm http 304 https://registry.npmjs.org/spooky
npm http GET https://registry.npmjs.org/get-stdin
npm http GET https://registry.npmjs.org/get-urls
npm http GET https://registry.npmjs.org/nopt
npm http GET https://registry.npmjs.org/mkdirp
npm http GET https://registry.npmjs.org/through2
npm http GET https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/decompress
npm http GET https://registry.npmjs.org/each-async
npm http 304 https://registry.npmjs.org/nopt
npm http 304 https://registry.npmjs.org/request
npm http 304 https://registry.npmjs.org/get-stdin
npm http 304 https://registry.npmjs.org/mkdirp
npm http 304 https://registry.npmjs.org/decompress
npm http 304 https://registry.npmjs.org/through2
npm http 304 https://registry.npmjs.org/get-urls
npm http 304 https://registry.npmjs.org/each-async
npm http GET https://registry.npmjs.org/underscore
npm http GET https://registry.npmjs.org/async
npm http GET https://registry.npmjs.org/tiny-jsonrpc
npm http GET https://registry.npmjs.org/carrier
npm http GET https://registry.npmjs.org/duplexer
npm http GET https://registry.npmjs.org/readable-stream
npm http 304 https://registry.npmjs.org/underscore
npm http 304 https://registry.npmjs.org/async
npm http 304 https://registry.npmjs.org/duplexer
npm http 304 https://registry.npmjs.org/carrier
npm http GET https://registry.npmjs.org/readable-stream
npm http GET https://registry.npmjs.org/map-key
npm http 304 https://registry.npmjs.org/tiny-jsonrpc
npm http GET https://registry.npmjs.org/xtend
npm http GET https://registry.npmjs.org/rimraf
npm http GET https://registry.npmjs.org/tar
npm http GET https://registry.npmjs.org/tempfile
npm http GET https://registry.npmjs.org/extname
npm http GET https://registry.npmjs.org/stream-combiner
npm http 304 https://registry.npmjs.org/readable-stream
npm http GET https://registry.npmjs.org/adm-zip
npm http 304 https://registry.npmjs.org/readable-stream
npm http 304 https://registry.npmjs.org/tar
npm http GET https://registry.npmjs.org/colors
npm http GET https://registry.npmjs.org/cycle
npm http GET https://registry.npmjs.org/eyes
npm http GET https://registry.npmjs.org/pkginfo
npm http 304 https://registry.npmjs.org/tempfile
npm http GET https://registry.npmjs.org/stack-trace
npm http GET https://registry.npmjs.org/abbrev
npm http 304 https://registry.npmjs.org/map-key
npm http 304 https://registry.npmjs.org/xtend
npm http 304 https://registry.npmjs.org/rimraf
npm http GET https://registry.npmjs.org/xmlhttprequest
npm http 304 https://registry.npmjs.org/extname
npm http GET https://registry.npmjs.org/cssom
npm http GET https://registry.npmjs.org/cssstyle
npm http GET https://registry.npmjs.org/contextify
npm http GET https://registry.npmjs.org/htmlparser2
npm http GET https://registry.npmjs.org/nwmatcher
npm http GET https://registry.npmjs.org/mime
npm http GET https://registry.npmjs.org/forever-agent
npm http GET https://registry.npmjs.org/node-uuid
npm http GET https://registry.npmjs.org/tough-cookie
npm http GET https://registry.npmjs.org/form-data
npm http GET https://registry.npmjs.org/http-signature
npm http GET https://registry.npmjs.org/tunnel-agent
npm http GET https://registry.npmjs.org/oauth-sign
npm http GET https://registry.npmjs.org/hawk
npm http GET https://registry.npmjs.org/qs
npm http GET https://registry.npmjs.org/aws-sign2
npm http GET https://registry.npmjs.org/json-stringify-safe
npm http GET https://registry.npmjs.org/core-util-is
npm http GET https://registry.npmjs.org/isarray/0.0.1
npm http GET https://registry.npmjs.org/string_decoder
npm http GET https://registry.npmjs.org/inherits
npm http GET https://registry.npmjs.org/object-keys
npm http 304 https://registry.npmjs.org/adm-zip
npm http 304 https://registry.npmjs.org/colors
npm http 304 https://registry.npmjs.org/stream-combiner
npm http 304 https://registry.npmjs.org/cycle
npm http 304 https://registry.npmjs.org/eyes
npm http 304 https://registry.npmjs.org/abbrev
npm http 304 https://registry.npmjs.org/xmlhttprequest
npm http 304 https://registry.npmjs.org/stack-trace
npm http 304 https://registry.npmjs.org/cssom
npm http 304 https://registry.npmjs.org/cssstyle
npm http 304 https://registry.npmjs.org/pkginfo
npm http GET https://registry.npmjs.org/uuid
npm http GET https://registry.npmjs.org/lodash
npm http GET https://registry.npmjs.org/underscore.string
npm http GET https://registry.npmjs.org/underscore.string
npm http GET https://registry.npmjs.org/ext-list
npm http 304 https://registry.npmjs.org/htmlparser2
npm http 304 https://registry.npmjs.org/forever-agent
npm http 304 https://registry.npmjs.org/contextify
npm http 304 https://registry.npmjs.org/nwmatcher
npm http 304 https://registry.npmjs.org/mime
npm http 304 https://registry.npmjs.org/node-uuid
npm http 304 https://registry.npmjs.org/http-signature
npm http 304 https://registry.npmjs.org/tunnel-agent
npm http 304 https://registry.npmjs.org/tough-cookie
npm http 304 https://registry.npmjs.org/form-data
npm http 304 https://registry.npmjs.org/oauth-sign
npm http 304 https://registry.npmjs.org/hawk
npm http 304 https://registry.npmjs.org/qs
npm http 304 https://registry.npmjs.org/json-stringify-safe
npm http 304 https://registry.npmjs.org/string_decoder
npm http 304 https://registry.npmjs.org/aws-sign2
npm http 304 https://registry.npmjs.org/core-util-is
npm http 304 https://registry.npmjs.org/inherits
npm http 304 https://registry.npmjs.org/isarray/0.0.1
npm http 304 https://registry.npmjs.org/object-keys
npm http 304 https://registry.npmjs.org/uuid
npm http 304 https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/lodash
npm http 304 https://registry.npmjs.org/ext-list
npm http GET https://registry.npmjs.org/combined-stream
npm http GET https://registry.npmjs.org/bindings
npm http GET https://registry.npmjs.org/nan
npm http 304 https://registry.npmjs.org/combined-stream
npm http GET https://registry.npmjs.org/ctype/0.5.2
npm http GET https://registry.npmjs.org/assert-plus/0.1.2
npm http GET https://registry.npmjs.org/asn1/0.1.11
npm http 304 https://registry.npmjs.org/bindings
npm http 304 https://registry.npmjs.org/nan
npm http 304 https://registry.npmjs.org/assert-plus/0.1.2
npm http GET https://registry.npmjs.org/fstream
npm http 304 https://registry.npmjs.org/ctype/0.5.2
npm http GET https://registry.npmjs.org/block-stream
npm http 304 https://registry.npmjs.org/asn1/0.1.11
npm http 304 https://registry.npmjs.org/block-stream
npm http 304 https://registry.npmjs.org/fstream
npm http GET https://registry.npmjs.org/delayed-stream/0.0.5

[email protected] install /usr/local/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify
node-gyp rebuild

npm http 304 https://registry.npmjs.org/delayed-stream/0.0.5
npm http GET https://registry.npmjs.org/cookie-jar
npm http GET https://registry.npmjs.org/aws-sign
npm http GET https://registry.npmjs.org/punycode
npm http GET https://registry.npmjs.org/entities
npm http GET https://registry.npmjs.org/domhandler
npm http GET https://registry.npmjs.org/domutils
npm http GET https://registry.npmjs.org/domelementtype
npm http 304 https://registry.npmjs.org/cookie-jar
npm http 304 https://registry.npmjs.org/aws-sign
npm http GET https://registry.npmjs.org/boom
npm http GET https://registry.npmjs.org/hoek
npm http GET https://registry.npmjs.org/sntp
npm http GET https://registry.npmjs.org/cryptiles
npm http 304 https://registry.npmjs.org/punycode
npm http 304 https://registry.npmjs.org/domutils
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.25","npm":"1.3.10"})
npm http 304 https://registry.npmjs.org/boom
npm http 304 https://registry.npmjs.org/domelementtype
npm http 304 https://registry.npmjs.org/domhandler
npm http 304 https://registry.npmjs.org/entities
npm http 304 https://registry.npmjs.org/hoek
npm http GET https://registry.npmjs.org/graceful-fs
npm http 304 https://registry.npmjs.org/sntp
npm http 304 https://registry.npmjs.org/cryptiles
npm http 304 https://registry.npmjs.org/graceful-fs
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.25","npm":"1.3.10"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.25","npm":"1.3.10"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.25","npm":"1.3.10"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.25","npm":"1.3.10"})
make: Entering directory /usr/local/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify/build' CXX(target) Release/obj.target/contextify/src/contextify.o SOLINK_MODULE(target) Release/obj.target/contextify.node SOLINK_MODULE(target) Release/obj.target/contextify.node: Finished COPY Release/contextify.node make: Leaving directory/usr/local/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify/build'
/usr/local/bin/quickscrape -> /usr/local/lib/node_modules/quickscrape/bin/quickscrape.js
[email protected] /usr/local/lib/node_modules/quickscrape
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
└── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])

Multiple url error

quickscrape   --urllist urls.txt   --scraperdir ~/code/journal-scrapers/scrapers/
info: quickscrape launched with...
info: - URLs from file: undefined
info: - Scraperdir: /Users/rds45/code/journal-scrapers/scrapers/
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 6
info: processing URL: https://peerj.com/articles/704/
info: [scraper]. URL rendered. https://peerj.com/articles/704/.
info: waiting 20 seconds before next scrape

TypeError: Cannot call method 'concat' of undefined
    at Object.module.exports.compose (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/lib/eventparse.js:63:15)
    at null.<anonymous> (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/bin/quickscrape.js:153:18)
    at EventEmitter.emit (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:339:22)
    at null.<anonymous> (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:69:14)
    at EventEmitter.emit (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:339:22)
    at null.cb (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:309:15)
    at Ticker.tick (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/ticker.js:32:10)
    at null.<anonymous> (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:258:20)
    at EventEmitter.emit (events.js:98:17)
    at Request._callback (/Users/rds45/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/renderer/basic.js:16:16)

No casperjs installation found

Not sure if the installation process is actually fully working yet...

This is on Bath machine, Lubuntu 12.04 LTS 64-bit

ran curl | sudo bash installation from readme

then cloned the journal-scrapers, but peerj single article example did not work:

-SNIP-

node-gyp rebuild

gyp WARN EACCES user "root" does not have permission to access the dev dir "/home/ross/.node-gyp/0.10.28"
gyp WARN EACCES attempting to reinstall using temporary dev dir "/usr/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify/.node-gyp"
gyp http GET http://nodejs.org/dist/v0.10.28/node-v0.10.28.tar.gz
gyp http 200 http://nodejs.org/dist/v0.10.28/node-v0.10.28.tar.gz
gyp http GET http://nodejs.org/dist/v0.10.28/SHASUMS.txt
gyp http GET http://nodejs.org/dist/v0.10.28/SHASUMS.txt
gyp http 200 http://nodejs.org/dist/v0.10.28/SHASUMS.txt
gyp http 200 http://nodejs.org/dist/v0.10.28/SHASUMS.txt
make: Entering directory /usr/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify/build' CXX(target) Release/obj.target/contextify/src/contextify.o SOLINK_MODULE(target) Release/obj.target/contextify.node SOLINK_MODULE(target) Release/obj.target/contextify.node: Finished COPY Release/contextify.node make: Leaving directory/usr/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify/build'
npm ERR! [email protected] install: node install.js
npm ERR! Exit status 8
npm ERR!
npm ERR! Failed at the [email protected] install script.
npm ERR! This is most likely a problem with the phantomjs package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node install.js
npm ERR! You can get their info via:
npm ERR! npm owner ls phantomjs
npm ERR! There is likely additional logging output above.
npm ERR! System Linux 3.2.0-63-generic
npm ERR! command "/usr/bin/node" "/usr/bin/npm" "install" "--global" "--unsafe-perms" "casperjs" "quickscrape"
npm ERR! cwd /home/ross/Documents/corpuses/quickscrape
npm ERR! node -v v0.10.28
npm ERR! npm -v 1.4.9
npm ERR! code ELIFECYCLE
/usr/bin/quickscrape -> /usr/lib/node_modules/quickscrape/bin/quickscrape.js
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/ross/Documents/corpuses/quickscrape/npm-debug.log
npm ERR! not ok code 0

Usage: quickscrape [options]

Options:

-h, --help              output usage information
-V, --version           output the version number
-u, --url <url>         URL to scrape
-r, --urllist <path>    path to file with list of URLs to scrape (one per line)
-s, --scraper <path>    path to scraper definition (in JSON format)
-o, --output <path>     where to output results (directory will be created if it doesn't exist
-r, --ratelimit <int>   maximum number of scrapes per minute (default 3)
-l, --loglevel <level>  amount of information to log (silent, verbose, info*, data, warn, error, or debug)

quickscrape successfully installed!
ross@ross-x3:/Documents/corpuses/quickscrape$ git clone https://github.com/ContentMine/journal-scrapers.git
Cloning into 'journal-scrapers'...
WARNING: gnome-keyring:: couldn't connect to: /tmp/keyring-SndeC9/pkcs11: No such file or directory
remote: Reusing existing pack: 35, done.
remote: Total 35 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (35/35), done.
ross@ross-x3:
/Documents/corpuses/quickscrape$ quickscrape \

--url https://peerj.com/articles/384
--scraper journal-scrapers/peerj.json
--output peerj-384

/usr/lib/node_modules/quickscrape/bin/quickscrape.js:64
throw new Error(msg);
^
Error: Nocasperjs installation found.See installation instructions at https://github.com/ContentMine/quickscrape
at fs.readFileSync.encoding (/usr/lib/node_modules/quickscrape/bin/quickscrape.js:64:11)
at Array.forEach (native)
at Object. (/usr/lib/node_modules/quickscrape/bin/quickscrape.js:57:27)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:906:3

quickscrape fails on reinstall on MAC OSX; 'xcodebuild' error and

An older version of quickscrape failed when running against a new bmc.json file. Decided to reinstall quickscrape using instructions in README.md. install went fine until the following. Assumed this was not fatal, tried to run anyway but it crashed as below

INSTALL>> (clipped until error)
...
npm http GET https://registry.npmjs.org/rimraf
npm http 304 https://registry.npmjs.org/rimraf
npm http 304 https://registry.npmjs.org/mkdirp

[email protected] install /usr/local/lib/node_modules/quickscrape/node_modules/thresher/node_modules/jsdom/node_modules/contextify
node-gyp rebuild

xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer/CommandLineTools' is a command line tools instance

xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer

/CommandLineTools' is a command line tools instance

CXX(target) Release/obj.target/contextify/src/contextify.o
SOLINK_MODULE(target) Release/contextify.node
SOLINK_MODULE(target) Release/contextify.node: Finished
/usr/local/bin/quickscrape -> /usr/local/lib/node_modules/quickscrape/bin/quickscrape.js
[email protected] /usr/local/lib/node_modules/quickscrape
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected])
├── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
└── [email protected] ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])

RUNNING:
(bmc.json is actually a copy of mdpi.json which I intended to edit later) ...

localhost:scrapers pm286$ quickscrape -u http://www.trialsjournal.com/content/15/1/481 -s ./bmc.json
info: quickscrape launched with...
info: - URL: http://www.trialsjournal.com/content/15/1/481
info: - Scraper: ./bmc.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.trialsjournal.com/content/15/1/481

TypeError: Cannot read property 'actions' of null
at /usr/local/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:100:16
at Request._callback (/usr/local/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:89:5)
at Request.self.callback (/usr/local/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:236:22)
at Request.EventEmitter.emit (events.js:98:17)
at Request. (/usr/local/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1142:14)
at Request.EventEmitter.emit (events.js:117:20)
at IncomingMessage. (/usr/local/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1096:12)
at IncomingMessage.EventEmitter.emit (events.js:117:20)
at _stream_readable.js:919:16
at process._tickCallback (node.js:419:13)
localhost:scrapers pm286$

OSX install fail

from @petermr, moved from http://pads.cottagelabs.com/p/contentminescraping

okapi:quickscrape pm286$ sudo npm install --global quickscrape
Password:
npm http GET https://registry.npmjs.org/quickscrape
npm http 304 https://registry.npmjs.org/quickscrape
npm http GET https://registry.npmjs.org/xpath
npm http GET https://registry.npmjs.org/winston
npm http GET https://registry.npmjs.org/which
npm http GET https://registry.npmjs.org/sleep
npm http GET https://registry.npmjs.org/commander
npm http GET https://registry.npmjs.org/spooky
npm http GET https://registry.npmjs.org/jsdom
npm http GET https://registry.npmjs.org/download
npm http 200 https://registry.npmjs.org/which
npm http GET https://registry.npmjs.org/which/-/which-1.0.5.tgz
npm http 304 https://registry.npmjs.org/sleep
npm http 200 https://registry.npmjs.org/xpath
npm http GET https://registry.npmjs.org/xpath/-/xpath-0.0.6.tgz
npm http 200 https://registry.npmjs.org/commander
npm http 304 https://registry.npmjs.org/spooky
npm http 200 https://registry.npmjs.org/winston
npm http 304 https://registry.npmjs.org/download
npm http GET https://registry.npmjs.org/winston/-/winston-0.7.3.tgz
npm http 200 https://registry.npmjs.org/which/-/which-1.0.5.tgz
npm http 200 https://registry.npmjs.org/xpath/-/xpath-0.0.6.tgz
npm http 200 https://registry.npmjs.org/winston/-/winston-0.7.3.tgz
npm http 200 https://registry.npmjs.org/jsdom
npm http GET https://registry.npmjs.org/jsdom/-/jsdom-0.10.6.tgz
npm http 200 https://registry.npmjs.org/jsdom/-/jsdom-0.10.6.tgz
npm http GET https://registry.npmjs.org/get-stdin
npm http GET https://registry.npmjs.org/each-async
npm http GET https://registry.npmjs.org/get-urls
npm http GET https://registry.npmjs.org/decompress
npm http GET https://registry.npmjs.org/nopt
npm http GET https://registry.npmjs.org/mkdirp
npm http GET https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/through2
npm http GET https://registry.npmjs.org/mkdirp
npm http 304 https://registry.npmjs.org/nopt
npm http 304 https://registry.npmjs.org/get-urls
npm http 304 https://registry.npmjs.org/get-stdin
npm http 304 https://registry.npmjs.org/decompress
npm http 304 https://registry.npmjs.org/each-async
npm http GET https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/htmlparser2
npm http GET https://registry.npmjs.org/nwmatcher
npm http GET https://registry.npmjs.org/xmlhttprequest
npm http GET https://registry.npmjs.org/cssom
npm http GET https://registry.npmjs.org/cssstyle
npm http GET https://registry.npmjs.org/contextify
npm http GET https://registry.npmjs.org/tiny-jsonrpc
npm http GET https://registry.npmjs.org/underscore
npm http GET https://registry.npmjs.org/async
npm http GET https://registry.npmjs.org/carrier
npm http GET https://registry.npmjs.org/readable-stream
npm http GET https://registry.npmjs.org/duplexer
npm http 200 https://registry.npmjs.org/mkdirp
npm http 200 https://registry.npmjs.org/mkdirp
npm http 304 https://registry.npmjs.org/through2
npm http GET https://registry.npmjs.org/cycle
npm http GET https://registry.npmjs.org/colors
npm http GET https://registry.npmjs.org/eyes
npm http GET https://registry.npmjs.org/pkginfo
npm http GET https://registry.npmjs.org/stack-trace
npm http GET https://registry.npmjs.org/async
npm http GET https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/nwmatcher
npm http 200 https://registry.npmjs.org/request
npm http 304 https://registry.npmjs.org/htmlparser2
npm http 304 https://registry.npmjs.org/nwmatcher
npm http 304 https://registry.npmjs.org/xmlhttprequest
npm http 304 https://registry.npmjs.org/cssom
npm http 304 https://registry.npmjs.org/cssstyle
npm http 304 https://registry.npmjs.org/contextify
npm http 304 https://registry.npmjs.org/tiny-jsonrpc
npm http 304 https://registry.npmjs.org/carrier

> [email protected] install /usr/local/lib/node_modules/quickscrape/node_modules/sleep
> node build.js || nodejs build.js

gyp WARN EACCES user "root" does not have permission to access the dev dir "/Users/pm286/.node-gyp/0.10.28"
gyp WARN EACCES attempting to reinstall using temporary dev dir "/usr/local/lib/node_modules/quickscrape/node_modules/sleep/.node-gyp"
gyp http GET http://nodejs.org/dist/v0.10.28/node-v0.10.28.tar.gz
gyp http 200 http://nodejs.org/dist/v0.10.28/node-v0.10.28.tar.gz
gyp WARN install got an error, rolling back install
gyp http GET http://nodejs.org/dist/v0.10.28/SHASUMS.txt
gyp http GET http://nodejs.org/dist/v0.10.28/SHASUMS.txt
gyp WARN install got an error, rolling back install
gyp ERR! configure error 
gyp ERR! stack Error: unexpected eof
gyp ERR! stack     at decorate (/usr/local/lib/node_modules/npm/node_modules/fstream/lib/abstract.js:67:36)
gyp ERR! stack     at Extract.Abstract.error (/usr/local/lib/node_modules/npm/node_modules/fstream/lib/abstract.js:61:12)
gyp ERR! stack     at Extract._streamEnd (/usr/local/lib/node_modules/npm/node_modules/tar/lib/extract.js:75:22)
gyp ERR! stack     at BlockStream.<anonymous> (/usr/local/lib/node_modules/npm/node_modules/tar/lib/parse.js:50:8)
gyp ERR! stack     at BlockStream.EventEmitter.emit (events.js:92:17)
gyp ERR! stack     at BlockStream._emitChunk (/usr/local/lib/node_modules/npm/node_modules/block-stream/block-stream.js:203:10)
gyp ERR! stack     at BlockStream.resume (/usr/local/lib/node_modules/npm/node_modules/block-stream/block-stream.js:58:15)
gyp ERR! stack     at Extract.Reader.resume (/usr/local/lib/node_modules/npm/node_modules/fstream/lib/reader.js:253:34)
gyp ERR! stack     at DirWriter.ondrain (stream.js:61:14)
gyp ERR! stack     at DirWriter.EventEmitter.emit (events.js:92:17)
gyp ERR! System Darwin 13.2.0
gyp ERR! command "node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild" "--global" "quickscrape"
gyp ERR! cwd /usr/local/lib/node_modules/quickscrape/node_modules/sleep
gyp ERR! node -v v0.10.28
gyp ERR! node-gyp -v v0.13.0
gyp ERR! not ok 
[sleep]: Error: Failed to execute 'node-gyp rebuild --global quickscrape' (1)
sh: nodejs: command not found
npm http 200 https://registry.npmjs.org/underscore
npm http 304 https://registry.npmjs.org/readable-stream
npm http 304 https://registry.npmjs.org/duplexer
npm http GET https://registry.npmjs.org/adm-zip
npm http GET https://registry.npmjs.org/extname
npm http GET https://registry.npmjs.org/map-key
npm http GET https://registry.npmjs.org/rimraf
npm http GET https://registry.npmjs.org/stream-combiner
npm http GET https://registry.npmjs.org/tempfile
npm http GET https://registry.npmjs.org/tar
npm http GET https://registry.npmjs.org/xtend
npm http GET https://registry.npmjs.org/abbrev
npm http GET https://registry.npmjs.org/mime
npm http GET https://registry.npmjs.org/node-uuid
npm http GET https://registry.npmjs.org/forever-agent
npm http GET https://registry.npmjs.org/tunnel-agent
npm http GET https://registry.npmjs.org/tough-cookie
npm http GET https://registry.npmjs.org/form-data
npm http GET https://registry.npmjs.org/oauth-sign
npm http GET https://registry.npmjs.org/http-signature
npm http GET https://registry.npmjs.org/hawk
npm http GET https://registry.npmjs.org/aws-sign2
npm http GET https://registry.npmjs.org/json-stringify-safe
npm http GET https://registry.npmjs.org/qs
npm http 200 https://registry.npmjs.org/colors
npm http GET https://registry.npmjs.org/colors/-/colors-0.6.2.tgz
npm http 200 https://registry.npmjs.org/cycle
npm http GET https://registry.npmjs.org/cycle/-/cycle-1.0.3.tgz
npm http 200 https://registry.npmjs.org/eyes
npm http GET https://registry.npmjs.org/eyes/-/eyes-0.1.8.tgz
npm http 200 https://registry.npmjs.org/async
npm http 200 https://registry.npmjs.org/request
npm http 200 https://registry.npmjs.org/pkginfo
npm http GET https://registry.npmjs.org/pkginfo/-/pkginfo-0.3.0.tgz
npm http 200 https://registry.npmjs.org/stack-trace
npm http GET https://registry.npmjs.org/stack-trace/-/stack-trace-0.0.9.tgz
npm http 200 https://registry.npmjs.org/cycle/-/cycle-1.0.3.tgz
npm http GET https://registry.npmjs.org/bindings
npm http GET https://registry.npmjs.org/nan
npm http 200 https://registry.npmjs.org/eyes/-/eyes-0.1.8.tgz
npm http GET https://registry.npmjs.org/string_decoder
npm http GET https://registry.npmjs.org/core-util-is
npm http GET https://registry.npmjs.org/isarray
npm http GET https://registry.npmjs.org/inherits
npm http 200 https://registry.npmjs.org/colors/-/colors-0.6.2.tgz
npm http 304 https://registry.npmjs.org/extname
npm http 304 https://registry.npmjs.org/adm-zip
npm http 304 https://registry.npmjs.org/nwmatcher
npm http 304 https://registry.npmjs.org/rimraf
npm http 200 https://registry.npmjs.org/stream-combiner
npm http 304 https://registry.npmjs.org/map-key
npm http 200 https://registry.npmjs.org/pkginfo/-/pkginfo-0.3.0.tgz
npm http GET https://registry.npmjs.org/domutils
npm http GET https://registry.npmjs.org/domelementtype
npm http GET https://registry.npmjs.org/domhandler
npm http GET https://registry.npmjs.org/entities
npm http 304 https://registry.npmjs.org/tempfile
npm http 304 https://registry.npmjs.org/xtend
npm http 200 https://registry.npmjs.org/request
npm http GET https://registry.npmjs.org/request/-/request-2.16.6.tgz
npm http 200 https://registry.npmjs.org/stack-trace/-/stack-trace-0.0.9.tgz
npm http GET https://registry.npmjs.org/object-keys
npm http 304 https://registry.npmjs.org/abbrev
npm http 200 https://registry.npmjs.org/async
npm http 200 https://registry.npmjs.org/tar
npm http 200 https://registry.npmjs.org/mime
npm http 200 https://registry.npmjs.org/node-uuid
npm http 304 https://registry.npmjs.org/forever-agent
npm http 304 https://registry.npmjs.org/tunnel-agent
npm http 304 https://registry.npmjs.org/tough-cookie
npm http 304 https://registry.npmjs.org/form-data
npm http GET https://registry.npmjs.org/uuid
npm http 200 https://registry.npmjs.org/request/-/request-2.16.6.tgz
npm http GET https://registry.npmjs.org/lodash
npm http GET https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/oauth-sign
npm http 304 https://registry.npmjs.org/http-signature
npm http 304 https://registry.npmjs.org/aws-sign2
npm http 304 https://registry.npmjs.org/hawk
npm http 304 https://registry.npmjs.org/json-stringify-safe
npm http GET https://registry.npmjs.org/ext-list
npm http GET https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/qs
npm http 304 https://registry.npmjs.org/core-util-is
npm http 304 https://registry.npmjs.org/string_decoder
npm http 304 https://registry.npmjs.org/bindings
npm http 200 https://registry.npmjs.org/inherits
npm http 304 https://registry.npmjs.org/domutils
npm http 304 https://registry.npmjs.org/domelementtype
npm http 304 https://registry.npmjs.org/isarray
npm http 200 https://registry.npmjs.org/nan
npm http 304 https://registry.npmjs.org/entities
npm http 304 https://registry.npmjs.org/domhandler
npm http 304 https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/underscore.string
npm http 304 https://registry.npmjs.org/object-keys
npm http 200 https://registry.npmjs.org/uuid
npm http 304 https://registry.npmjs.org/ext-list
npm http 200 https://registry.npmjs.org/lodash
npm http GET https://registry.npmjs.org/combined-stream
npm http GET https://registry.npmjs.org/assert-plus
npm http GET https://registry.npmjs.org/asn1
npm http GET https://registry.npmjs.org/ctype
npm http GET https://registry.npmjs.org/block-stream
npm http GET https://registry.npmjs.org/fstream
npm http GET https://registry.npmjs.org/punycode
npm http GET https://registry.npmjs.org/cryptiles
npm http GET https://registry.npmjs.org/hoek
npm http GET https://registry.npmjs.org/boom
npm http GET https://registry.npmjs.org/sntp
npm http GET https://registry.npmjs.org/cookie-jar
npm http GET https://registry.npmjs.org/aws-sign
npm http GET https://registry.npmjs.org/form-data/-/form-data-0.0.10.tgz
npm http GET https://registry.npmjs.org/hawk/-/hawk-0.10.2.tgz
npm http GET https://registry.npmjs.org/oauth-sign/-/oauth-sign-0.2.0.tgz
npm http GET https://registry.npmjs.org/forever-agent/-/forever-agent-0.2.0.tgz
npm http GET https://registry.npmjs.org/tunnel-agent/-/tunnel-agent-0.2.0.tgz
npm http GET https://registry.npmjs.org/json-stringify-safe/-/json-stringify-safe-3.0.0.tgz
npm http GET https://registry.npmjs.org/qs/-/qs-0.5.6.tgz
npm http 304 https://registry.npmjs.org/assert-plus
npm http 200 https://registry.npmjs.org/hawk/-/hawk-0.10.2.tgz
npm http 200 https://registry.npmjs.org/forever-agent/-/forever-agent-0.2.0.tgz
npm http 200 https://registry.npmjs.org/form-data/-/form-data-0.0.10.tgz
npm http 304 https://registry.npmjs.org/fstream
npm http 304 https://registry.npmjs.org/block-stream
npm http 200 https://registry.npmjs.org/oauth-sign/-/oauth-sign-0.2.0.tgz
npm http 200 https://registry.npmjs.org/json-stringify-safe/-/json-stringify-safe-3.0.0.tgz
npm http 200 https://registry.npmjs.org/qs/-/qs-0.5.6.tgz
npm http GET https://registry.npmjs.org/graceful-fs
npm http 200 https://registry.npmjs.org/tunnel-agent/-/tunnel-agent-0.2.0.tgz
npm http 304 https://registry.npmjs.org/asn1
npm http 304 https://registry.npmjs.org/punycode
npm http 304 https://registry.npmjs.org/ctype
npm http 304 https://registry.npmjs.org/cryptiles
npm http 200 https://registry.npmjs.org/cookie-jar
npm http GET https://registry.npmjs.org/cookie-jar/-/cookie-jar-0.2.0.tgz
npm http 304 https://registry.npmjs.org/sntp
npm http 304 https://registry.npmjs.org/boom
npm http 304 https://registry.npmjs.org/graceful-fs
npm http 200 https://registry.npmjs.org/aws-sign
npm http GET https://registry.npmjs.org/aws-sign/-/aws-sign-0.2.0.tgz
npm http 200 https://registry.npmjs.org/cookie-jar/-/cookie-jar-0.2.0.tgz
npm http 200 https://registry.npmjs.org/aws-sign/-/aws-sign-0.2.0.tgz
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.28","npm":"1.4.9"})
npm http 200 https://registry.npmjs.org/hoek
npm http GET https://registry.npmjs.org/hoek/-/hoek-0.7.6.tgz
npm http GET https://registry.npmjs.org/boom/-/boom-0.3.8.tgz
npm http GET https://registry.npmjs.org/cryptiles/-/cryptiles-0.1.3.tgz
npm http GET https://registry.npmjs.org/sntp/-/sntp-0.1.4.tgz
npm http 200 https://registry.npmjs.org/sntp/-/sntp-0.1.4.tgz
npm http 200 https://registry.npmjs.org/hoek/-/hoek-0.7.6.tgz
npm http 200 https://registry.npmjs.org/boom/-/boom-0.3.8.tgz
npm http 304 https://registry.npmjs.org/combined-stream
npm http 200 https://registry.npmjs.org/cryptiles/-/cryptiles-0.1.3.tgz
npm http GET https://registry.npmjs.org/delayed-stream
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.28","npm":"1.4.9"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.28","npm":"1.4.9"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.28","npm":"1.4.9"})
npm WARN engine [email protected]: wanted: {"node":"0.8.x"} (current: {"node":"v0.10.28","npm":"1.4.9"})
npm http 304 https://registry.npmjs.org/delayed-stream

> [email protected] install /usr/local/lib/node_modules/quickscrape/node_modules/jsdom/node_modules/contextify
> node-gyp rebuild

gyp ERR! UNCAUGHT EXCEPTION 
gyp ERR! stack Error: ENOENT, no such file or directory
gyp ERR! stack     at process.cwd (/usr/local/lib/node_modules/npm/node_modules/graceful-fs/polyfills.js:8:19)
gyp ERR! stack     at Object.exports.resolve (path.js:309:52)
gyp ERR! stack     at configure (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:26:23)
gyp ERR! stack     at Object.self.commands.(anonymous function) [as configure] (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/node-gyp.js:66:37)
gyp ERR! stack     at run (/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js:72:30)
gyp ERR! stack     at process._tickCallback (node.js:419:13)
gyp ERR! System Darwin 13.2.0
gyp ERR! command "node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"

/usr/local/lib/node_modules/npm/node_modules/graceful-fs/polyfills.js:8
    cwd = origCwd.call(process)
                  ^
Error: ENOENT, no such file or directory
    at process.cwd (/usr/local/lib/node_modules/npm/node_modules/graceful-fs/polyfills.js:8:19)
    at errorMessage (/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js:119:28)
    at issueMessage (/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js:125:3)
    at process.<anonymous> (/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js:109:3)
    at process.EventEmitter.emit (events.js:95:17)
    at process._fatalException (node.js:272:26)

> [email protected] install /usr/local/lib/node_modules/quickscrape/node_modules/jsdom-xpath/node_modules/jsdom/node_modules/contextify
> node-gyp rebuild



node.js:815
    var cwd = process.cwd();
                      ^
Error: ENOENT, no such file or directory
    at Function.startup.resolveArgv0 (node.js:815:23)
    at startup (node.js:58:13)
    at node.js:906:3
npm ERR! [email protected] install: `node build.js || nodejs build.js`
npm ERR! Exit status 127
npm ERR! 
npm ERR! Failed at the [email protected] install script.
npm ERR! This is most likely a problem with the sleep package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     node build.js || nodejs build.js
npm ERR! You can get their info via:
npm ERR!     npm owner ls sleep
npm ERR! There is likely additional logging output above.
npm ERR! System Darwin 13.2.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "--global" "quickscrape"
npm ERR! cwd /Users/pm286/workspace/quickscrape
npm ERR! node -v v0.10.28
npm ERR! npm -v 1.4.9
npm ERR! code ELIFECYCLE
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/pm286/workspace/quickscrape/npm-debug.log
npm ERR! not ok code 0
okapi:quickscrape pm286$ 

Is there a scraper for Elsevier

Did you create a scraper for Elsevier (I know the lazy load Javascript needing addressing)? I seemed to remember you created at least a prototype. We shall need one.

Generalized post-processing

Some fields may require post-processing beyond regex, which would be defined in the scraper, such as date fields which may be in various formats depending on the journal. One way to generalize this would be like so:

"date": {
      "selector": "//meta[@name='citation_date']",
      "attribute": "content"
      "post-processor": {
           "name:"  "dateconvert"
           "arguments:" {"format": "%B %d, %Y"}
       }
}

dateconvert.js could be placed either in the working directory, scrapers directory, or in a processors directory, and would take argument dateconvert(element, arguments).

This would allow the flexibility of user-defined functions without going all Zotero.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.