salimk / rcrawler Goto Github PK

View Code? Open in Web Editor NEW

349.0 40.0 94.0 611 KB

An R web crawler and scraper

Home Page: http://www.sciencedirect.com/science/article/pii/S2352711017300110

License: Other

R 100.00%

r rpackage crawler scraper webcrawler webscraping webscraper webscrapping crawlers

rcrawler's People

Contributors

Stargazers

Watchers

Forkers

elseviersoftwarex iamkbpark linkonabe elephann chetan1120 fr4nc3 ganjianhao yushu-liu memoresaycool mygamesuu gauthiernpn voltek62 simongxking georggr wall-eeeeeee hsalkhatib zhlcbri thesmarthomeninja bunmiaj julianumbhau sriramakella123 jackho327 weekend-warrior tpemartin chenchingchih zhaoxiaohe ikhuerta josterpi ryantheriot rmatam tang-junjie cdcrabtree kylemonahan elefthcn kirillshaman qinlab mrandr91 millitroy eboot mkhoin ethannical hamedf62 imccommons leofn slzhao mahamega jmetrics86 luke-a ahmedsamouka maganaluis jluisjuncal apoorv74 hhy5277 br3068 chenshengkuang irdiavxhi peacelovingng luis-lchc suatatan ranjankislay danrg benitezrcamilo rashamad tudomuonnam r-ou-python earwickerh xuzhenwu dwtcourses nnovacekgf da11tura rickeyestes andressha yjp138930 pixgarden chrisjb febikambu mstei4176 laasousa microvn mattrowse sc0h0 shicheng-guo jun-lizst stooplion pedrorbf haitaoshi jing-xinxing irah02 jraker01 sievidzo

rcrawler's Issues

Download database

Hi,

it is a very interesting package you have written here, but I dont really get into it ...
I want to download the data stored in an online database that unfortunately doesnt have a download function itself. Therefore, I would like to use your package.

The database can be accessed via http://gepris.dfg.de/gepris/OCTOPUS?language=en
As a result, I would like to have a dataframe containing the structured data the database contains - either for the complete database or for specific keywords one can filter by.

Can you help me with this issue? That would be great, thanks!

Rcrawler use when there is a proxy

First...impressive work here, good job. I use the httr and curl packages to pull data from Smartsheet. Where I work they use a proxy server, so I have to use 'use_proxy' from the curl package and then feed that into my 'GET' statement from httr. The issue is that I want to scrape a web page with various links and cannot seem to find a way to incorporate the proxy into the 'Rcrawler' function. I tried without success using "website = paste("www.website.com", config_proxy), no_cores = 4, no_conn = 4)" where I set up my proxy to be named config_proxy with the curl package 'use_proxy' function. Is there a specific way I can pass the proxy information to the Rcrawler function? I tried your example with various alterations on incorporating the config_proxy variable however no success.

Issues with encoding: texts in downloaded HTML file are garbled

I tried the example in tutorial and find that texts in downloaded HTML are garbled(I opened it with chrome in UTF-8 encoding): garbled text (left is downloaded version and right is online version).

I tried to switch system locale, it was Chinese and I switched to English. But it still doesn't work.

The encoding should be recognized correctly:

Id	Url	Stats	Level	OUT	IN	Http Resp	Content Type	Encoding	Accuracy
1	http://www.glofile.com	finished	0	13	1	200	text/html	UTF-8

#Doesn't work
> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

#Also doesn't work
> Sys.setlocale("LC_ALL","English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

> devtools::session_info()
Session info ------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, mingw32             
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Los_Angeles         
 date     2018-04-28                  

Packages ----------------------------------------------------------------------------------------------------------------
 package    * version    date       source                             
 base       * 3.4.4      2018-03-15 local                              
 clipr        0.4.0      2017-11-03 CRAN (R 3.4.2)                     
 codetools    0.2-15     2016-10-05 CRAN (R 3.4.4)                     
 compiler     3.4.4      2018-03-15 local                              
 curl         3.1        2017-12-12 CRAN (R 3.4.3)                     
 data.table   1.10.4-3   2017-10-27 CRAN (R 3.4.3)                     
 datasets   * 3.4.4      2018-03-15 local                              
 devtools     1.13.4     2017-11-09 CRAN (R 3.4.3)                     
 digest       0.6.14     2018-01-14 CRAN (R 3.4.3)                     
 doParallel   1.0.11     2017-09-28 CRAN (R 3.4.3)                     
 foreach      1.4.4      2017-12-12 CRAN (R 3.4.3)                     
 graphics   * 3.4.4      2018-03-15 local                              
 grDevices  * 3.4.4      2018-03-15 local                              
 httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                     
 iterators    1.0.9      2017-12-12 CRAN (R 3.4.3)                     
 magrittr     1.5        2014-11-22 CRAN (R 3.4.1)                     
 memoise      1.1.0      2017-04-21 CRAN (R 3.4.1)                     
 methods    * 3.4.4      2018-03-15 local                              
 parallel     3.4.4      2018-03-15 local                              
 purrr        0.2.4      2017-10-18 CRAN (R 3.4.2)                     
 R6           2.2.2      2017-06-17 CRAN (R 3.4.1)                     
 Rcpp         0.12.15    2018-01-20 CRAN (R 3.4.3)                     
 Rcrawler   * 0.1.7-0    2017-11-01 CRAN (R 3.4.4)                     
 rlang        0.1.6      2017-12-21 CRAN (R 3.4.3)                     
 rstudioapi   0.7.0-9000 2018-01-17 Github (rstudio/rstudioapi@109e593)
 selectr      0.3-1      2016-12-19 CRAN (R 3.4.1)                     
 stats      * 3.4.4      2018-03-15 local                              
 stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                     
 stringr      1.2.0      2017-02-18 CRAN (R 3.4.2)                     
 tools        3.4.4      2018-03-15 local                              
 utils      * 3.4.4      2018-03-15 local                              
 withr        2.1.1      2017-12-19 CRAN (R 3.4.3)                     
 XML          3.98-1.9   2017-06-19 CRAN (R 3.4.1)                     
 xml2         1.1.1      2017-01-24 CRAN (R 3.4.1)                     
 yaml         2.1.16     2017-12-12 CRAN (R 3.4.3)

Arabic characters are not rendered correctly

Hi Salim,

I was testing Rcrawler with 'http://www.emiratesmarsmission.ae/ar//' and I found out that all Arabic characters in the saved HTML files are turned into unicode.

Here is an example:

<U+0628><U+0639><U+062F><U+0627><U+0644><U+0627><U+0646><U+062A><U+0647><U+0627><U+0621><U+0645><U+0646><U+0639><U+0645><U+0644><U+064A><U+0629><U+062A><U+0635><U+0646><U+064A><U+0639><U+0623><U+062F><U+0648><U+0627><U+062A><U+0627><U+0644><U+0645><U+062C><U+0633><U+0645><U+0627><U+0644><U+0647><U+0646><U+062F><U+0633><U+064A><U+0644><U+0644><U+0642><U+0645><U+0631><U+0627><U+0644><U+0627><U+0635><U+0637><U+0646><U+0627><U+0639><U+064A><U+064A><U+0642><U+0648><U+0645><U+0645><U+0647><U+0646><U+062F><U+0633><U+0636><U+0645><U+0627><U+0646><U+0627><U+0644><U+062C><U+0648><U+062F><U+0629><U+0628><U+0645><U+0639><U+0627><U+064A><U+0646><U+0629><U+0627><U+0644><U+0623><U+062F><U+0648><U+0627><U+062A><U+0627><U+0644><U+062A><U+064A><U+062A><U+0645><U+062A><U+0635><U+0646><U+064A><U+0639><U+0647><U+0627> <U+064A><U+0642><U+0648><U+0645><U+0645><U+0647><U+0646><U+062F><U+0633><U+0627><U+0644><U+0645><U+0631><U+0643><U+0632><U+0628><U+062A><U+0635><U+0646><U+064A><U+0639><U+0623><U+062F><U+0648><U+0627><U+062A><U+0627><U+0644><U+0645><U+062C><U+0633><U+0645><U+0627><U+0644><U+0647><U+0646><U+062F><U+0633><U+064A><U+0644><U+0644><U+0642><U+0645><U+0631><U+0627><U+0644><U+0627><U+0635><U+0637><U+0646><U+0627><U+0639><U+064A>

Any idea how to fix this and make the Arabic text rendered normally?

Thanks,
Mohamed Zeid

Scraping articles from a news source filtered by a search term

I'd like to apply Rcrawler to various major news outlets (e.g., BBS, NBC, FOX, etc.) but only scrape articles that are relevant to my topic (e.g., the Volkswagen emissions scandal). Is it possible for me to do this with Rcrawler?

Feature: Render javascript using splashr

I found that by simply changing a single line in the linkextractor.R from readhtml to renderhtml from the splashr package, one can apparently crawl javascript enforcing sites, too.
Especially interesting is the combo with this docker image, making tor crawls optional too:
https://github.com/TeamHG-Memex/aquarium

Would be nice to see this as optional in a future version. Or even better, mixing the framework with the interactivity options provided by Rselenium, but that would mean larger changes I guess. Anyway, this is as far I can see the most advanced scrapy competitor out there in the R language, would be nice to see it grow as well. Much better than rharvest.

ExtractCSSPat issue on search results

Dear Salim;
thank you so much for such great efforts and useful package!

i have an issue in crawling and data gathering in search result pages.

in ExtractCSSPat mentioned few CSS rules but some of the pages doesn't include the required data and CSS is not available on them.

below error would occured:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
In addition: Warning messages:
1: In UseMethod("xml_remove") :
closing unused connection 7 (<-DESKTOP-K73RC4R:11502)
2: In UseMethod("xml_remove") :
closing unused connection 6 (<-DESKTOP-K73RC4R:11502)
3: In UseMethod("xml_remove") :
closing unused connection 5 (<-DESKTOP-K73RC4R:11502)
4: In UseMethod("xml_remove") :
closing unused connection 4 (<-DESKTOP-K73RC4R:11502)
5: In UseMethod("xml_remove") :
closing unused connection 3 (C:/Users/Hamed/Documents/tripadvisor.com-281027/extracted_data.csv)

****i though its possible to put a conditional statement to check if the CSS tag is not available then return null in data set.

bypass or initiate javascript loaded menus

Thank you so much for putting this together!
I would like to scrape a page that has a menu of categories, but the crawler stops at the "show all" or "show more" buttons that load the remaining content of the menu. Is there a workaround solution to this?

Vbrowser Argument Syntax

Line 516 of the code:

"cat("browser:"+i+" port:"+pkg.env$Lbrowsers[[i]]$process$port)"

returns non-numeric argument to binary operator error.

It should use "," instead of "+". I.e.

"cat("browser:",i," port:",pkg.env$Lbrowsers[[i]]$process$port)"

In Func Rcrawler, "Error: object 'LinkExtractor' not found"

When calling function Rcrawler::Rcrawler() on any website, if I don't run library(Rcrawler) prior, I get this error:

Rcrawler::Rcrawler("https://www.google.com")
#> Error in get(name, envir = envir) : object 'LinkExtractor' not found

If I simply call library(Rcrawler) prior, everything works as expected (but I must load via library to avoid the error).

The source of the error message is this line in file Rcrawlerp.R:

clusterExport(cl, c("LinkExtractor","LinkNormalization"))

If Rcrawler isn't explicitly loaded, the two pkg functions cannot be found/imported by clusterExport.

I'm running R 3.4.2 on Windows 10. I'm getting the error in both the CRAN version and the dev version from GitHub.

Here's a complete workflow that causes the error for me, followed by session_info:

> # Start with a fresh install from GitHub.
> devtools::install_github("salimk/Rcrawler")
Downloading GitHub repo salimk/Rcrawler@master
from URL https://api.github.com/repos/salimk/Rcrawler/zipball/master
Installing Rcrawler

* installing *source* package 'Rcrawler' ...
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (Rcrawler)
> 
> 
> # Attempt to crawl a website.
> Rcrawler::Rcrawler("https://www.google.com")
Error in get(name, envir = envir) : object 'LinkExtractor' not found
> 
> 
> # Print session info.
> devtools::session_info()
Session info --------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, mingw32             
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Chicago             
 date     2017-11-25                  

Packages ------------------------------------------------------------------------------------------------------------------------------------------------
 package    * version  date       source                          
 base       * 3.4.2    2017-09-28 local                           
 codetools    0.2-15   2016-10-05 CRAN (R 3.4.2)                  
 compiler     3.4.2    2017-09-28 local                           
 curl         3.0      2017-10-06 CRAN (R 3.4.2)                  
 data.table   1.10.4-3 2017-10-27 CRAN (R 3.4.2)                  
 datasets   * 3.4.2    2017-09-28 local                           
 devtools     1.13.4   2017-11-09 CRAN (R 3.4.2)                  
 digest       0.6.12   2017-01-27 CRAN (R 3.4.1)                  
 doParallel   1.0.11   2017-09-28 CRAN (R 3.4.2)                  
 foreach      1.4.3    2015-10-13 CRAN (R 3.4.2)                  
 git2r        0.19.0   2017-07-19 CRAN (R 3.4.1)                  
 graphics   * 3.4.2    2017-09-28 local                           
 grDevices  * 3.4.2    2017-09-28 local                           
 httr         1.3.1    2017-08-20 CRAN (R 3.4.1)                  
 iterators    1.0.8    2015-10-13 CRAN (R 3.4.1)                  
 memoise      1.1.0    2017-04-21 CRAN (R 3.4.1)                  
 methods    * 3.4.2    2017-09-28 local                           
 parallel     3.4.2    2017-09-28 local                           
 R6           2.2.2    2017-06-17 CRAN (R 3.4.2)                  
 Rcpp         0.12.13  2017-09-28 CRAN (R 3.4.2)                  
 Rcrawler     0.1.5    2017-11-26 Github (salimk/Rcrawler@db76deb)
 stats      * 3.4.2    2017-09-28 local                           
 tools        3.4.2    2017-09-28 local                           
 utils      * 3.4.2    2017-09-28 local                           
 withr        2.1.0    2017-11-01 CRAN (R 3.4.2)                  
 xml2         1.1.1    2017-01-24 CRAN (R 3.4.1)                  
 yaml         2.1.14   2016-11-12 CRAN (R 3.4.1)

Possible features: "stop when found" function and "ignore all parameters"

Hello! I love the package! Extremely powerful!

I thought about two possible features that could come in handy while reading about the current functions:

I think it would be nice to implement a "stop" procedure so that the crawler does not continue once a certain target (or a certain number of targets?) is found.

For example, I might want to scrape a set of websites with unknown structures, knowing that each of them has a specific .pdf file somewhere. If I am only interested in these files from each of the websites, it would be much faster for me if I could tell the robot to stop searching a particular website once it finds the file.

would it be possible to just omit all the parameters and not the ones specifically specified? the algorithm could check for any word between "?" and "=" or between "&" and "=" or am I missing something?

Thanks and best!

Latest version has only "ExtractPatterns" argument instead of "ExtractXpathPat" and "ExtractCSSPat"

Hi,

The latest version does not sync with the ReadMe page.
Several references to Xpath and CSS patterns which does not exist now in the code.

In Rcrawler only "ExtractPatterns" is a valid argument.
In ContentScraper only "patterns" is a valid argument.
No CSS patterns allowed anymore.

May be the ReadMe page should be updated.

Regards,
Herman

Rcrawler aborts when crawling and scraping

Hi,

I would like to crawl and scrape the content of a whole website. This is the code:

Rcrawler(Website = URL, no_cores = 4, no_conn = 4, ExtractXpathPat = c("//./div[@class='bodytext']//p", "//./h1[@class='blogtitle']", "//./div[@id='kommentare']//p"), PatternsNames = c("article", "title", "comments"), ManyPerPattern = TRUE)

After retrieving approx. 19% of the data I get the following error message:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0

It always happens at the same point, DATA and INDEX are created correctly with all entries crawled until the error message.

Am I doing something wrong or is it something with the website I would like to crawl? I am using Rcrawler version 0.1.9-1.

Thanks for helping me out!

Crawl a list of URLs not possible?

Is it somehow possible to crawl a list of URLs?

I tried mapply like in this example:
mapply(dataset['url'], FUN=Rcrawler)

But it throws this error:

Error in (function (Website, no_cores, no_conn, MaxDepth, DIR, RequestsDelay = 0, :
object 'getNewM' not found

P.S.: I know it is possible to scrape a list of URLs with ContentScraper, but I would like to crawl a quite long list of different domains with the Rcrawler function

Store html content in database

how can I save the downloaded html with Rcrawler to a database instead of saving it to the local folder.

Retreive all hyperlinks

Thank you for the convenient package to crawl web page data.
I was especially interested in getting hrefs from a web site. In the readme.md I found that it is possible to pass an argument, "ExtractPatterns= c("//*/a/@href")", in the Rcrawler-function which should do the job.
Unfortunately though, this argument has been removed?!

I am usually not the impatient type, but could you tell me what would be the easiest way to get all hrefs from a web page under the current Rcrawler version? Using the LinkExtractor, for instance, does not do it by default.

Thanks for the support!

Rcrawler skips some links for no reason

I used Rcrawler to scrape this site: "http://www.thegreenbook.com/". I set it to only crawl one level using CSS and used url regular expression filter. But it ignores some links for no reason.

I used rvest,stringr to double check and found that 7 links are omitted.

Below is the code I used for double checking results.

library(Rcrawler)
library(rvest)
library(stringr)
library(dplyr)

url <- "http://www.thegreenbook.com/"
css = "#classificationIndex a"
filter_string = "products/search"
#using Rcrawler



Rcrawler(Website = url, 
         no_cores = 4, 
         no_conn = 4,
         ExtractCSSPat = c(css),
         MaxDepth = 1,
         urlregexfilter = c(filter_string))


length_Rcrawler <- nrow(INDEX[INDEX$Level==1,])

#using rvest  -----------------------------------------------------
#getting hrefs using the same css
hrefs <- html_session(url) %>% 
    html_nodes(css) %>% 
    html_attr("href")

hrefs_filtered <- hrefs[str_detect(hrefs,filter_string)] # filters as using `urlregexfilter`

length_rvest<- length(hrefs_filtered)

links retreived using Rcrawler and rvest are:

> length_Rcrawler
[1] 28
> length_rvest
[1] 35

Below are the links that Rcrawler omitted:

> setdiff(hrefs_filtered,INDEX[INDEX$Level==1,]$Url)
[1] "http://www.thegreenbook.com/products/search/electrical-guides/"               
[2] "http://www.thegreenbook.com/products/search/pharmaceutical-guides/"           
[3] "http://www.thegreenbook.com/products/search/office-equipment-supplies-guides/"
[4] "http://www.thegreenbook.com/products/search/garment-textile-guides/"          
[5] "http://www.thegreenbook.com/products/search/pregnancy-parenting-guides/"      
[6] "http://www.thegreenbook.com/products/search/beauty-care-guides"               
[7] "http://www.thegreenbook.com/products/search/golden-year-guides/"

I don't know what could possibly cause this issue, as the response codes are all 200 and Stats are all finished. Also, ExtractCSSPat and urlregexfilter are correct as I have double checked using rvest. So my conclusion is that these links are just ignored.

Did I do something wrong while using the Rcrawler or is it a bug? Any help is appreciated, thanks!

Normalization of Relative Links

Thanks a lot for all the work!

Several websites include links with relative references (e.g., "page-1.html" instead of "http://domain.com/page-1.html"). The LinkNormalization function works fine for absolute links but fails to correctly normalize relative links. Can you please extend that function so that it correctly recognizes relative links and, if necessary, not only adds the protocol to a link but also the base url.

Best wishes,
Michael

package install/load fail

hello,

mac osx 10.12.5
r 3.4.0
rstudio 1.0.143
java 8 update 131
i've downloaded using install.packages but get the following error message when i try to load using the library command. same with developer version:
+++

library(Rcrawler)
Error: package or namespace load failed for ‘Rcrawler’:
.onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so
Reason: image not found

+++
any help appreciated. thank you.

Error installation

Hello,

I'm new user of R and R Studio and i'm very interesting by your web crawler. Thanks for it.

But i have a problem when i want to install the package.

> system("java -version")
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)

Downloading GitHub repo salimk/Rcrawler@master
from URL https://api.github.com/repos/salimk/Rcrawler/zipball/master
Installing Rcrawler
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/jz/3njpl5xx15zc_xrsgpn53z_00000gn/T/Rtmpz4mtT1/devtools38c076f3b618/salimk-Rcrawler-fb92537'  \
  --library='/Users/damiencosta/Library/R/3.4/library' --install-tests 

* installing *source* package ‘Rcrawler’ ...
** R
** inst
** preparing package for lazy loading
Error : .onLoad failed in loadNamespace() for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: impossible de charger l'objet partagé '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so
  Reason: image not found
ERROR: lazy loading failed for package ‘Rcrawler’
* removing ‘/Users/damiencosta/Library/R/3.4/library/Rcrawler’
Installation failed: Command failed (1)

Someone for help please ?

Does it download the .pdf files as well?

Hi,

I was scraping data from a website when I realized that it's downloading .html files only. I couldn't find any info on whether or not it downloads PDF files from the website.
Can you please clarify this?

Regards,
Jatin Gupta

Rcrawler does not return any results

I ran the code for Rcrawler using the sample example, however this did not return any results.

can someone let me know what I am missing?

crawler(Website = "http://www.glofile.com", no_cores = 4, no_conn = 4)
In process : 1..
Progress: 100.00 % : 1 parssed from 1 | Collected pages: 0 | Level: 1

Check INDEX dataframe variable to see crawling details
Collected web pages are stored in Project folder
Project folder name : glofile.com-050550

why does Rcrawler have trouble with this page?

Rcrawler doesn't find the links in the filterable grid here:

https://www.crowdpac.com/campaigns

When I run:

Rcrawler(Website = 'https://www.crowdpac.com/campaigns',
		 no_cores = 4, 
		 no_conn = 4)

It only crawls 5 pages -- the pages linked in the top and bottom banners.

Too slow with keywordfilter option

How can I download pages faster by specifying KeywordsFilter value in Rcrawler()
For example
Rcrawler(Website ="http://www.salvex.com/listings/index.cfm?catID=1280&regID=0&mmID=0&orderBy=1&order=0&filterWithin=&f",Timeout=7,Encod=Getencoding("http://www.salvex.com/listings/index.cfm?catID=1280&regID=0&mmID=0&orderBy=1&order=0&filterWithin=&f"),no_cores = 2, KeywordsFilter = c("A250","aeroplane","AH-1" ,"AH-64" ,"aircraft","airframe","airplane","Allison","AS365 Dauphin","auction","aviation","Aviation Fueling Directory","Aviation Museums","Avionics","Bell 205 ","Bell 206 ","Bell 212","Bell 214","Bell 412","blades","Blades","Boeing","C20B","CH-47","CH-47 ","CH-53 ","Chinook","Cobra","driveshaft","engine","Eurocopter AS350","FLIR unit","fuel cell","Fuel Control","fuselage","gearbox","Ground Support Equipment (GSE)","Helicopter","hub assembly","Huey","Hughs","J85 ","Jet Ranger","JT8 ","JT9 ","Kamon","Kiowa","Long Ranger","LTS101 ","Lycoming","M250","main rotor blades","MR Blade","main rotor hub","MR Hub","MD500 ","OH-58","OH-58","OH-6","Pratt and Whitney","PT-6","PT6","Rolls Royce","servos","Sikorsky","skids","surplus","swashplate assembly","T53","T53-L-13B","T53-L-703","T55","T56 ","T58","T63","T63 ","T700 ","tail rotor blades","TR Blades","tail rotor hub","TR Hub","tailboom","transmission","turbine","Turboprop","UH1","UH-1","UH-1H","UH-60","UH60 ","vertical stabilizer","wire strike kit"),no_conn = 2,MaxDepth=1)

Installation on Windows 10 64 bit error No CurrentVersion entry in Software/JavaSoft registry! Try re-installing Java

When I try to install I get this error.

> install.packages("Rcrawler")
Installing package into 'W:/R-3.4._/R_LIBS_USER_3.4._'
(as 'lib' is unspecified)
installing the source package 'Rcrawler'

trying URL 'http://cran.rstudio.com/src/contrib/Rcrawler_0.1.tar.gz'
Content type 'application/x-gzip' length 20944 bytes (20 KB)
downloaded 20 KB

* installing *source* package 'Rcrawler' ...
** package 'Rcrawler' successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
Error: package or namespace load failed for 'Rcrawler':
 .onLoad failed in loadNamespace() for 'rJava', details:
  call: fun(libname, pkgname)
  error: No CurrentVersion entry in Software/JavaSoft registry! Try re-installing Java and make sure R and Java have matching architectures.
Error: loading failed
Execution halted
*** arch - x64
ERROR: loading failed for 'i386'
* removing 'W:/R-3.4._/R_LIBS_USER_3.4._/Rcrawler'

The downloaded source packages are in
        'W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM\downloaded_packages'
Warning messages:
1: running command '"W:/R-3.4._/App/R-Portable/bin/x64/R" CMD INSTALL -l "W:\R-3.4._\R_LIBS_USER_3.4._" W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM/downloaded_packages/Rcrawler_0.1.tar.gz' had status 1
2: In install.packages("Rcrawler") :
  installation of package 'Rcrawler' had non-zero exit status

But rJava loads fine

> library(rJava)

I tried running the installation manually

shell("start cmd /k", wait = FALSE)

W:\R-3.4._>"W:/R-3.4._/App/R-Portable/bin/x64/R" CMD INSTALL -l "W:\R-3.4._\R_LIBS_USER_3.4._" W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpSojYP9/downloaded_packages/Rcrawler_0.1.tar.gz'
Warning: invalid package 'W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM/downloaded_packages/Rcrawler_0.1.tar.gz''
Error: ERROR: no packages specified

I Checked the contents of (that does exist)
W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM/downloaded_packages/Rcrawler_0.1.tar.gz

Rcrawler_0.1.tar

Perhaps the contents are not correct? Was the .tar.gz made with "R CMD"?


> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.0

I can't ignore URL parameters joined by ";"

When I execute this code

Rcrawler("http://www.tirodefensivoperu.com/forum/", 4, 4,
         urlregexfilter = "?topic",
         ExtractXpathPat = c("//*[(@id = 'top_subject')]", "//div[@class='inner']"),
         PatternsNames = c("title", "post"),
         ManyPerPattern = TRUE,
         ignoreUrlParams = c("PHPSESSID", "prev_next")
         )

I still get scraped this type of urls

https://www.tirodefensivoperu.com/forum/index.php?topic=11796.0;prev_next=prev

Is there a way to specify the joining character?

Small bug when using Vbrowser

Preparing browser process Error in "browser:" + i : non-numeric argument to binary operator

Starting at line 514

      for(i in 1:no_cores){
        pkg.env$Lbrowsers[[i]]<-run_browser()
        cat("browser:"+i+" port:"+pkg.env$Lbrowsers[[i]]$process$port)
        Sys.sleep(1)
        cat(".")
        flush.console()
}

Rcrawler only returning one HTML file for entire website

Hi there,

I tried to crawl https://www.gov.sg/ using Rcrawler, however it only returns one HTML page in the local directory (and one row in the INDEX object).

Can you please let me know what is going wrong here?

Many thanks,

Tim

crawlUrlfilter

Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter argument.
From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter does exactly what I am looking for.

When the pattern passed to crawlUrlfilter contains only one level of the URL, like in the following code
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")

I get the desired results, i.e. only those URLS that match the pattern "article", e.g.

https://www.somewebsite.org/article/sample-article-217 or
https://www.somewebsite.org/article/2019-01-20-another-example

However, when I want to filter URLs based on a pattern of two levels of the URL, such as:

https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or
https://www.somewebsite.org/article/news/review-of-meetup

the following code does not find any matches:

Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")

Is this a bug, or am I getting something wrong?
Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/" it should be no problem at all passing an argument that contains several "/".

Attributes

Is it possible to get attributes using ContentScraper(), like I would get using rvest with the following commands?

read_html(url) %>%
html_nodes(xpath) %>%
html_attr("href")

Can Recrawler be used to scrape/crawl bilingual sites based on CSS selectors or Xpath?

Hi Recrawler team,

I am new to R and Recrawler. I would like to know if Recrawler can be used to scrape/crawl bilingual sites, let's say I have this English site:
https://government.ae/en
and this is the corresponding Arabic one:
https://government.ae/ar-ae

How can I use Recrawler to get the bitext from them and save the output in tab-delimited file?
Can you crawl only texts based on div tag, CSS selectors or maybe xpath?

Thanks

Crawling depthwise instead of breadthwise

When I set Maxdepth=1, It is crawling depth wise instead of breadthwise.
For example, page1.html has two links(page2.html,page3.html) in it.
Page2.html has link to page5.html

I want to crawl page1.html,page2.html,pages3.html but Rcrawler is crawling page1.html,page2.html,page5.html

Also, How can I crawl only starting page of a website with Rcrawler(just page1.html). I tried with MaxDepth=0. But it is not downloading any page content. It is just creating folder with the domain name.

Result data omits links not matching crawlUrlfilter filter

Thanks for this super useful package. I want to restrict the crawl to certain URL specifications, but capture all links on the crawled pages regardless of whether they match the filter. I can't get this to work in practice. An example:

Rcrawler(
  Website = "https://beta.companieshouse.gov.uk/company/02906991",
  no_cores = 4, no_conn = 4 ,
  NetworkData = TRUE, statslinks = TRUE,
  crawlUrlfilter = '02906991',
  saveOnDisk = F
)

Page https://beta.companieshouse.gov.uk/company/02906991/officers (which is crawled) includes links such as
https://beta.companieshouse.gov.uk/officers/... but these pages are not included in the results. E.g:

NetwIndex %>% str_subset('uk/officers')
character(0)

Shouldn't this links be captured, since I have provided no dataUrlfilter argument? Or am I missing something here?

how to crawl sites which are in different languages

I want to crawl "https://www.vebeg.de/web/en/verkauf/suchen.htm?DO_SUCHE=1&SUCH_MATGRUPPE=1300" website. This website has content in german language.

Also, How to overwrite/add the content to folder created with domain name. I am unable to crawl 2 different pages of same website at a time because it is giving me warning saying that folder already exists when I start crawling the different page of the same website.
Forexample

Rcrawler(Website ="https://www.proxibid.com/asp/SearchAdvanced_i.asp?searchTerm=aviation&category=all+categories#search", no_cores = 10, no_conn = 10,MaxDepth=1,Encod="UTF-8")
Rcrawler(Website ="https://www.proxibid.com/asp/AuctionsByCompany.asp?ahid=743", no_cores = 10, no_conn = 10,MaxDepth=1,Encod="UTF-8")

First command downloaded pages successfully. Second command is not downloading any pages.

Documentation in Manual is not showing up correctly

The example in the Reference manual for rcrawler function is not showing up correctly

Missing 'webdriver' package

When trying to use Rcrawler I get an error

Rcrawler(Website = "http://www.nytimes.com", KeywordsFilter = c("Paris accord"), KeywordsAccuracy = 100)

Preparing multihreading cluster .. Error in checkForRemoteErrors(lapply(cl, recvResult)) :
15 nodes produced errors; first error: there is no package called ‘webdriver’

However webdriver package is installed and loaded

Sys.info()
sysname release version
"Linux" "4.15.0-1031-azure" "#32-Ubuntu SMP Wed Oct 31 15:44:56 UTC 2018"
"x86_64"

I appreciate any guidance, thanks!

Error when crawling

When I try to crawl the following website, it always returns an error when the crawler gets to 24.98% complete.

Here is the R command:

Rcrawler(Website = "http://www.lamoncloa.gob.es", no_cores = 1, no_conn = 1)

Here is the error:

24.98 %  :  582  parssed from  2330 
In process : 586 ..
Error in allpaquet[[s]][[1]][[3]] : subscript out of bounds

Thanks,

Tim

Crawling pages with same url

Hello,

I'm trying to scrape press releases from the UN Office of the High Commissioner for Human Rights. The problem is that the website uses the same URL for its news search tool and any specific search that one runs -- it's always http://www.ohchr.org/EN/NewsEvents/Pages/NewsSearch.aspx. I should note that while the articles themselves have unique URLs, I also need the data from the search tables for my project.

So how can I crawl a website structured like this using Rcrawler? The program doesn't seem to find the table segments even if I specify them using CSS.

I've run the following script for a whole day without the crawler finding any match:
Rcrawler(Website = "http://www.ohchr.org/EN/NewsEvents/Pages/NewsSearch.aspx", ExtractCSSPat=c("#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblTitle", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblDate", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_NewsType li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_CountryID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_MandateID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_SubjectID li"), ManyPerPattern=T, PatternsNames = c("Title","Date", "News type", "Country ID", "Mandate", "Subject"))

Any help you can provide would be very much appreciated!

LinkExtractor gives empty result.

The following code gives empty result.
pageinfo<-LinkExtractor("https://www.michaelkors.com/blakely-leather-satchel/_/R-US_30S8SZLM6L")
The result is

[[1]]
[[1]][[1]]
[1] 748

[[1]][[2]]
[1] "https://www.michaelkors.com/blakely-leather-satchel/_/R-US_30S8SZLM6L"

[[1]][[3]]
[1] "NULL"

[[1]][[4]]
[1] 628

[[1]][[5]]
[1] ""

[[1]][[6]]
[1] ""

[[1]][[7]]
[1] ""

[[1]][[8]]
[1] ""

[[1]][[9]]
[1] ""


[[2]]
logical(0)

Error in LinkExtractor(url = Ur, encod = encod) : object 'Extlinks' not found

Hi,

Thanks for your great package! I wanted to extract data from the following url, however it throws an error.

Data<-ContentScraper(Url = "https://www.ge.ch/votations/20180304/participation", 
                     CssPatterns = c("li"), ManyPerPattern = T)

Error in LinkExtractor(url = Ur, encod = encod) : 
  object 'Extlinks' not found

I can get the content with rvest though, so i assume its not an issue related to the page itself.

part <- html("https://www.ge.ch/votations/20150308/cantonal/participation/")

part %>%
  html_nodes("li") %>%
  html_text()

Do you have an idea what could cause this error and what i can do to avoid it?

Rcrawler not crawling some websites

Hi Salim,

i am running Rcrawler on a vector of websites. i have noticed that it is failing to crawl some ex:

http://www.alahleia.com
http://www.almalki.com

tried several depth levels and timeout.

thank you

Keywords, can they be some form of regex or pattern search? Esp in the case of inflected words?

I am trying to crawl Polish websites. Here the word can have several inflection endings, it may also exist with slight change of the root of word. Can one use a "keyword" options with wildcards or basic regex? Can boolean logic be used?
I am very grateful for your very useful plugin and I wish you lots of success!!!

How to Not store/save html pages downloaded with Rcrawler

Wonderful package Salim!!!

I would like to know how I can avoid storing the html pages of the websites that I am crawling. I just need the list of urls for each website that are already available in the INDEX data.frame.
Is there an option to do it?

Thank you :)

Getting data recursively from website

I will need to download data from the following website. It only allows 90 per page though. Can we crawl through these pages and get all of the data?

https://www.michigantrafficcrashfacts.org/querytool/lists/0#q1;0;2016;;

problem with ampersand

When running following code

Rcrawler(Website = crawl_page
, no_cores = 2
, no_conn = 2
, RequestsDelay = 1
, MaxDepth = 3
, DIR = 'https://www.indeed.com/jobs?q=senior+data+analyst&l=Tampa%2C+FL&sort=date'
, urlregexfilter = c("/rc/","start="))

There is an issue when there is an ampersand in the web address returned to the Data INDEX.
It is getting converted to &
instead of the real & sign.
so in the INDEX it will show as
https://www.indeed.com/jobs?q=senior+data+analyst&l=Tampa%2C+FL&sort=date&start=40

Instead of
https://www.indeed.com/jobs?q=senior+data+analyst&l=Tampa%2C+FL&sort=date&start=40

So in effect the parameters:
l, sort, and start are not being passed the correct values in the URL.
Let me know if you need any more info to help correct.
Thanks
--Michael

Keyword Filter and Accuracy

Can you elaborate further on the filter and accuracy? On the section concerning the filter, it is mentioned that providing a keyword, say "keyword" will search for pages that include "keyword" at least one time on the page.
In the accuracy section following the filter guidelines, it's then said that a 50% accuracy rate means that "keyword" occurs at least once, while 100% means "keyword" occurs at least five times.
Is there something I'm not understanding? Webcrawling is a new concept for me, but I still can't seem to make sense of this particular section.

Error when running LinkExtractor

Hi Salim,

When I try to use the LinkExtractor function to crawl the New Zealand government website, I get an error.

Here is the code:

pageinfo <- LinkExtractor(url="https://www.govt.nz")

Here is the error:

Error in if (!is.na(links[t])) { : argument is of length zero

Can you please advise how to resolve this issue? I have tried many different combinations of the url parameter for this website, but none work (e.g., "https://www.govt.nz/", "https://govt.nz", "www.govt.nz", etc).

Edit: this error is also occurring for other sites. I encountered it when crawling the Canadian government website ("https://www.canada.ca/").

Thanks,

Tim

Password Protected Site

Is it possible to use RCrawler on a password protected site?

TimeZone Warning - No Crawling Executed

Hi,

for my rStudio the package installs perfectly, however I can't use "Rcrawler" for some reason. There is no error I get, it's just that rStudio will not respond anymore in the console. I can use rStudio but the function will not return in any way, nothing is displayed (error/warnings) in the console either. Except for this warning sometimes: "unknown timezone 'zone/tz/2017c.1.0/zoneinfo/Europe/Berlin'"

However the folder (project) and "extracted_contents.csv" are created.

Even if I use a one-pager website (without any furter pages, other than the homepage), it will not finish or reply or do anything. Only the rStudio consoles cursor is blinking... and nothing happens.

/edit: Aditional info
I can use "Data<-ContentScraper(Url = "http://glofile.com/index.php/2017/06/08/athletisme-m-a-rome/", CssPatterns = c(".entry-title",".entry-content"))" as expected, the data Variable is created and is containing the info expected.

Plus I updated my R now, still the same issue...

My setup: macOS High Sierra Version 10.13.1
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer

Scrape according to predict() result

Dear Salim, Dear Mohamed,

Your tool is awesome.

I would like to propose a big feature.

Let's assume we have a corpus of files we already scraped and found more interesting than others. We built a classification model that we can use on new documents with predict command to get a category or yes, no.

Can the rcrawler functions for crawling or scraping be developed to accept a model as a parameter to
scrape only content that fall in specific category?

Example:
I have collected a couple of texts that are related to individual bad mortgage loans in Swiss franc. It is a controversy in my country. And I collected equal number of articles related to other issues in the same "economy" section.
It would be a marvelous tool if I could tell it to
scrape(starting_point=http://mywebsite/economy, predict_filter=predict(model=my_classification_model)

Wishing you all the best,

Jacek

salimk / rcrawler Goto Github PK

rcrawler's People

Contributors

Stargazers

Watchers

Forkers

rcrawler's Issues

Recommend Projects

Recommend Topics

Recommend Org