Code Monkey home page Code Monkey logo

warc's Introduction

THIS WILL LIKELY BE UNDERGOING MAJOR CHANGES

See: https://github.com/hrbrmstr/jwatr for a new spin on the WARC ecosystem in R

The current thought is to move higher order functions to here (read/write WARC) and expose lower-level stuff via jwatr, but jwatr may also just become warc for a CRAN release.


warc : Tools to Work with the Web Archive Ecosystem

WARC files (and the metadata files that usually follow them) are the de facto method of archiving web content. There are tools in Python & Java to work with this data and there are many "big data" tools that make working with large-scale data from sites like Common Crawl and The Internet Archive very straightforward.

Now there are tools to create and work with the WARC ecosystem in R.

Possible use-cases:

  • If you need to scrape data from many URLs and would like to make the analyses on that data reproducible but are concerned that the sites may change format or may be offline but also don't want to manage individual HTML (etc) files
  • Analyzing Common Crawl data (etc) natively in R
  • Saving the entire state of an httr request (warc can turn httr responses into WARC files and turns WARC response records into httr::response objects)

warc can work with WARC files that are composed of individual gzip streams or on plaintext WARC files and can also read & generate CDX files. Support for more file types (e.g. WET, WAT, etc) are planned.

Since I ended up making some gz file functions for this package, it only seemed appropriate to expose them.

The following functions are implemented:

  • as_warc: Convert an ‘httr::respone’ object to WARC response objects
  • create_cdx: Takes as input an optionally compressed WARC file and creates a
  • create_warc_wget: Newer versions of ‘wget’ are designed to support capturing of or
  • gz_close: Close the gz file
  • gz_eof: Test for end of file
  • gz_flush: This will flush all zlib output buffers for the current file and
  • gz_fseek: Sets the starting position for the next ‘gz_read()’ or
  • gz_gets: Read a line from a gz file
  • gz_gets_raw: Read a line from a gz file
  • gz_offset: Return the current raw compressed offset in the file
  • gz_open: Open a gzip file for reading or writing
  • gz_read_char: Read from a gz file into a character vector
  • gz_read_raw: Read from a gz file into a raw vector
  • gz_seek: Sets the starting position for the next ‘gz_read()’ or
  • gz_tell: Return the current raw uncompressedf offset in the file
  • gz_write_char: Write an atomic character vector to a file
  • gz_write_raw: Write a raw vector to a gz file
  • gzip_inflate_from_pos: Given a gzip file that was built with concatenated individual gzip
  • read_cdx: CDX files are used to index the content of WARC files.
  • read_warc_entry: Given the path to a WARC file (compressed or uncompressed) and the
  • warc_headers: Extract WARC headers from a WARC response object
  • write_warc_record: Write a WARC record to a file

Installation

You need wget on your system PATH. Folks on real operating systems can do the apt-get, yum install or brew install (et al) dance for your particular system. Version 1.18+ is recommended, but any version with support for WARC extensions should do.

Windows folks will need to grab the statically linked 32-bit or 64-bit binaries from here and put them on your system PATH somewhere if you want to create WARC files in bulk using wget.

devtools::install_git("https://gitlab.com/hrbrmstr/warc.git")

Usage

library(warc)
library(httr)

# current verison
packageVersion("warc")
## [1] '0.1.0'
cdx <- read_cdx(system.file("extdata", "20160901.cdx", package="warc"))

i <- 5

path <- file.path(cdx$warc_path[i], cdx$file_name[i])
start <- cdx$compressed_arc_file_offset[i]

entry <- read_warc_entry(path, start)

print(entry)
## Response [https://r-project.org/]
##   Date: 2016-09-01 18:03
##   Status: 301
##   Content-Type: text/html; charset=iso-8859-1
##   Size: 314 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
## <html><head>
## <title>301 Moved Permanently</title>
## </head><body>
## <h1>Moved Permanently</h1>
## <p>The document has moved <a href="https://www.r-project.org/">here</a>.</p>
## <hr>
## <address>Apache/2.4.10 (Debian) Server at r-project.org Port 443</address>
## </body></html>
print(warc_headers(entry))
## $`warc-type`
## [1] "response"
## 
## $`warc-record-id`
## [1] "<urn:uuid:9EB1B94E-A929-4B5E-AE10-FB54C2E7308B>"
## 
## $`warc-warcinfo-id`
## [1] "<urn:uuid:7D7DC0CA-8FC7-4FFB-9D5A-4187EB50ED74>"
## 
## $`warc-concurrent-to`
## [1] "<urn:uuid:592019F2-C037-4899-8DB2-CC04266B9E29>"
## 
## $`warc-target-uri`
## [1] "https://r-project.org/"
## 
## $`warc-date`
## [1] "2016-09-01 18:03:46 UTC"
## 
## $`warc-ip-address`
## [1] "137.208.57.37"
## 
## $`warc-block-digest`
## [1] "\020"
## 
## $`warc-payload-digest`
## [1] ":\001"
## 
## $`content-type`
## [1] "application/http;msgtype=response"
## 
## $`content-length`
## [1] 578
## 
## attr(,"class")
## [1] "insensitive" "list"
print(status_code(entry))
## [1] 301
print(http_type(entry))
## [1] "text/html"

Creating + reading

library(warc)
library(purrr)
library(rvest)

warc_dir <- file.path(tempdir(), "rfolks")
dir.create(warc_dir)

urls <- c("http://rud.is/",
          "http://hadley.nz/",
          "http://dirk.eddelbuettel.com/",
          "https://jeroenooms.github.io/",
          "https://ironholds.org/")

create_warc_wget(urls, warc_dir, warc_file="rfolks-warc")

cdx <- read_cdx(file.path(warc_dir, "rfolks-warc.cdx"))

sites <- map(1:nrow(cdx),
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))

map(sites, ~read_html(content(., as="text", encoding="UTF-8"))) %>% 
  map_chr(~html_text(html_nodes(., "title")))
## [1] "rud.is"              "Hadley Wickham"      "Dirk Eddelbuettel"   "About Jeroen..."     "Oliver Keyes - Home"
unlink(warc_dir)

Test Results

library(warc)
library(testthat)

date()
## [1] "Mon Sep 12 15:27:08 2016"
test_dir("tests/")
## testthat results ========================================================================================================
## OK: 0 SKIPPED: 0 FAILED: 0
## 
## DONE ===================================================================================================================

warc's People

Contributors

hrbrmstr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

warc's Issues

`strnstr` was not declared in this scope

> install.packages(repos=NULL, type='source', '/opt/r-warc')
* installing *source* package ‘warc’ ...
** libs
g++ -std=gnu++11 -I"/usr/lib32/R/include" -DNDEBUG -I../inst/include -I"/usr/lib/R/library/Rcpp/include"    -fpic  -D_FORTIFY_SOURCE=2 -mtune=i686 -O2 -pipe   -g -c RcppExports.cpp -o RcppExports.o
g++ -std=gnu++11 -I"/usr/lib32/R/include" -DNDEBUG -I../inst/include -I"/usr/lib/R/library/Rcpp/include"    -fpic  -D_FORTIFY_SOURCE=2 -mtune=i686 -O2 -pipe   -g -c gzindex.cpp -o gzindex.o
gzindex.cpp:61: warning: "_ISOC99_SOURCE" redefined
 #define _ISOC99_SOURCE

In file included from /usr/include/c++/8.2/i686-pc-linux-gnu/bits/os_defines.h:39,
                 from /usr/include/c++/8.2/i686-pc-linux-gnu/bits/c++config.h:508,
                 from /usr/include/c++/8.2/cmath:41,
                 from /usr/lib/R/library/Rcpp/include/Rcpp/platform/compiler.h:100,
                 from /usr/lib/R/library/Rcpp/include/Rcpp/r/headers.h:59,
                 from /usr/lib/R/library/Rcpp/include/RcppCommon.h:29,
                 from /usr/lib/R/library/Rcpp/include/Rcpp.h:27,
                 from gzindex.cpp:1:
/usr/include/features.h:194: note: this is the location of the previous definition
 # define _ISOC99_SOURCE 1

gzindex.cpp: In function 'void int_create_cdx_from_warc(std::__cxx11::string, std::__cxx11::string, std::__cxx11::string, std::__cxx11::string)':
gzindex.cpp:177:19: error: 'strnstr' was not declared in this scope
         char *v = strnstr(buf, ": ", strlen(buf));
                   ^~~~~~~
gzindex.cpp:177:19: note: suggested alternative: 'strstr'
         char *v = strnstr(buf, ": ", strlen(buf));
                   ^~~~~~~
                   strstr
gzindex.cpp:264:21: error: 'strnstr' was not declared in this scope
           char *v = strnstr(buf, ": ", strlen(buf));
                     ^~~~~~~
gzindex.cpp:264:21: note: suggested alternative: 'strstr'
           char *v = strnstr(buf, ": ", strlen(buf));
                     ^~~~~~~
                     strstr
make: *** [/usr/lib32/R/etc/Makeconf:168: gzindex.o] Error 1
ERROR: compilation failed for package ‘warc’
* removing ‘/usr/lib/R/library/warc’
Warning message:
In install.packages(repos = NULL, type = "source", "/opt/r-warc") :
  installation of package ‘/opt/r-warc’ had non-zero exit status

Too many open files when mapping more than 509 pages

That is a great pleasure working with warc, however I'm experiencing error when mapping larger mount of files. It seems like the connections to the files are not closed. Please find below the reproducible minimum example:

library(warc)
library(tidyverse)

# download the Common Crawl example file if does not exist
warc_big <- normalizePath("~/cc.warc.gz")    
if(!file.exists(warc_big)){
  download.file(
    "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/warc/CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz",
    warc_big
  )
}

# create index if does not exist
warc_cdx <- normalizePath("~/cc.cdx")
if(!file.exists(warc_cdx)){
  create_cdx(
    warc_big,
    cdx_path = warc_cdx
  )
}
  
# read the index and mapp the data
cdx <- read_cdx(warc_cdx)

# this works
sites <- map(1:100,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))                     
                              
 # this crash
sites_large <- map(1:1000,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))     

The error I'm receiving is the following

Using the hard way
7593104
Error in gz_open(wf, "read") : object 'wf' not found

And if want to perform other operations getting:

> ?read_cdx
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
  In gzfile(file, "rb") :
  cannot open compressed file 'C:/Program Files/R/R-3.4.1/library/reshape2/Meta/package.rds', probable reason 'Too many open files'

Session info:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2    dplyr_0.7.2     purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1   tidyverse_1.1.1
[9] warc_0.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     cellranger_1.1.0 compiler_3.4.1   plyr_1.8.4       bindr_0.1        forcats_0.2.0    tools_3.4.1     
 [8] uuid_0.1-2       lubridate_1.6.0  jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1 
[15] rlang_0.1.1      psych_1.7.5      parallel_3.4.1   haven_1.1.0      xml2_1.1.1       httr_1.2.1       stringr_1.2.0   
[22] hms_0.3          grid_3.4.1       glue_1.1.1       R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.0    
[29] reshape2_1.4.2   magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[36] stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2   

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.