Code Monkey home page Code Monkey logo

webmiddens's Introduction

webmiddens

Project Status: Active – The project has reached a stable, usable state and is being actively developed. R-CMD-check codecov

simple caching of HTTP requests/responses, hooking into webmockr (https://github.com/ropensci/webmockr) for the HTTP request matching

A midden is a debris pile constructed by a woodrat/pack rat (https://en.wikipedia.org/wiki/Pack_rat#Midden)

the need

  • vcr is meant really for testing, or script use. i don't think it fits well into a use case where another pkg wants to cache responses
  • memoise seems close-ish but doesn't fit needs, e.g., no expiry, not specific to HTTP requests, etc.
  • we need something specific to HTTP requests, that allows expiration handling, a few different caching location options, works across HTTP clients, etc
  • caching just the http responses means the rest of the code in the function can change, and the response can still be cached
    • the downside, vs. memoise, is that we're only caching the http response, so if there's still a lot of time spent processing the response, then the function will still be quite slow - BUT, if the HTTP response processing code is within a function, you could memoise that function
  • memoise is great, but since it caches the whole function call, you don't benefit from individually caching each http request, which we do here. if you cache each http request, then any time you do that same http request, it's response is already cached

brainstorming

  • use webmockr to match requests (works with crul; soon httr)
  • possibly match on, and expire based on headers: Cache-Control, Age, Last-Modified, ETag, Expires (see Ruby's faraday-http-cache (https://github.com/plataformatec/faraday-http-cache#what-gets-cached))
  • caching backends: probably all binary to save disk space since most likely we don't need users to be able to look at plain text of caches
  • expiration: set a time to expire. if set to 2019-03-08 00:00:00 and it's 2019-03-07 23:00:00, then 1 hr from now the cache will expire, and a new real HTTP request will need to be made (i.e., the cache will be deleted whenever the next HTTP request is made)

http libraries

right now we only support crul, but httr support should arrive soon

installation

remotes::install_github("sckott/webmiddens")

use_midden()

library(webmiddens)
library(crul)

Let's say you have some function http_request() that does an HTTP request that you re-use in various parts of your project or package

http_request <- function(...) {
  x <- crul::HttpClient$new("https://httpbin.org", opts = list(...))
  x$get("get")
}

And you have a function some_fxn() that uses http_request() to do the HTTP request, then proces the results to a data.frame or list, etc. This is a super common pattern in a project or R package that deals with web resources.

some_fxn <- function(...) {
  res <- http_request(...)
  jsonlite::fromJSON(res$parse("UTF-8"))
}

Without webmiddens the HTTP request happens as usual and all is good

some_fxn()
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip, deflate"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "libcurl/7.74.0 r-curl/4.3 crul/1.0.2.92"
#> 
#> $headers$`X-Amzn-Trace-Id`
#> [1] "Root=1-5fd29de8-0e978093689e02246d0b3d92"
#> 
#> 
#> $origin
#> [1] "24.21.229.59"
#> 
#> $url
#> [1] "https://httpbin.org/get"

Now, with webmiddens

run wm_configuration() first to set the path where HTTP requests will be cached

wm_configuration("foo1")
#> configuring midden from $path

first request is a real HTTP request

res1 <- use_midden(some_fxn())
res1
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip, deflate"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "libcurl/7.74.0 r-curl/4.3 crul/1.0.2.92"
#> 
#> $headers$`X-Amzn-Trace-Id`
#> [1] "Root=1-5fd29de8-3ad69a2f59e45afc48446e85"
#> 
#> 
#> $origin
#> [1] "24.21.229.59"
#> 
#> $url
#> [1] "https://httpbin.org/get"

second request uses the cached response from the first request

res2 <- use_midden(some_fxn())
res2
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip, deflate"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "libcurl/7.74.0 r-curl/4.3 crul/1.0.2.92"
#> 
#> $headers$`X-Amzn-Trace-Id`
#> [1] "Root=1-5fd29de8-65506d0055d8b5c874949851"
#> 
#> 
#> $origin
#> [1] "24.21.229.59"
#> 
#> $url
#> [1] "https://httpbin.org/get"

the midden class

x <- midden$new()
x # no path
#> <midden> 
#>   path: 
#>   expiry (sec): not set
# Run $init() to set the path
x$init(path = "forest")
x
#> <midden> 
#>   path: /Users/sckott/Library/Caches/R/forest
#>   expiry (sec): not set

The cache slot has a hoardr object which you can use to fiddle with files, see ?hoardr::hoard

x$cache
#> <hoard> 
#>   path: forest
#>   cache path: /Users/sckott/Library/Caches/R/forest

Use expire() to set the expire time (in seconds). You can set it through passing to expire() or through the environment variable WEBMIDDENS_EXPIRY_SEC

x$expire()
#> NULL
x$expire(5)
#> [1] 5
x$expire()
#> [1] 5
x$expire(reset = TRUE)
#> NULL
x$expire()
#> NULL
Sys.setenv(WEBMIDDENS_EXPIRY_SEC = 35)
x$expire()
#> [1] 35
x$expire(reset = TRUE)
#> NULL
x$expire()
#> NULL

FIXME: The below not working right now - figure out why

wm_enable()
con <- crul::HttpClient$new("https://httpbin.org")
# first request is a real HTTP request
x$r(con$get("get", query = list(stuff = "bananas")))
# following requests use the cached response
x$r(con$get("get", query = list(stuff = "bananas")))

verbose output

x <- midden$new(verbose = TRUE)
x$init(path = "rainforest")
x$r(con$get("get", query = list(stuff = "bananas")))

set expiration time

x <- midden$new()
x$init(path = "grass")
x$expire(3)
x

Delete all the files in your "midden" (the folder with cached files)

x$cleanup()

Delete the "midden" (the folder with cached files)

x$destroy()

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for webmiddens in R doing citation(package = 'webmiddens')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

webmiddens's People

Contributors

sckott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

jimsforks

webmiddens's Issues

things that could be better

  • its confusing knowing how to deal with expiry time. if you include expire in 1st request, but not in the 2nd exact same request, does the expiry setting from the 1st apply to the 2nd one?

  • we need a way to simply wrap a package/script fxn in a webmiddens fxn instead of having to make the user modify the internals of their code , eg.,

# some fxn that does an http request
some_fxn <- function(x) {
  some_http_request(x)
}

# simply wrap it in a webmiddens fxn, with settings with a separte fxn perhaps
library(webmiddens)
web_settings(dir = "foo/bar", expire = 3)
cache(some_fxn(...))
# and can override session level settings
cache(some_fxn(...), expire = 5)

cache backend options

  • in memory: an obvious choice needed
  • on disk: should be default for the majority use case of using this in other pkgs where you want users to have responses cached ACROSS R sessions (in memory no good in this case)

env vars to control pkg behavior

  • turn all caching on or off (WEBMIDDENS_TURN_OFF), takes logical
  • set the expiry time via env var (WEBMIDDENS_EXPIRY_SEC), takes integer

One remark on the README

the downside, vs. memoise, is that we're only caching the http response, so if there's still a lot of time spent processing the response, then the function will still be quite slow

But you could still memoise that response processing part if you make it a function. 😉

expiring with response headers

right now the only method we have is time based: number of seconds since the request was recorded.

perhaps we can use response headers, following https://github.com/sourcelevel/faraday-http-cache :

p.s.

Seems that the folowing are often found together:

  • cache-control and age
  • last-modified, if-modified-since, if-unmodified-since
  • etag

Any way to not require HTTP requests to be wrapped in a code block?

Right now, we need to wrap HTTP call in midden$call()

Maybe we keep that as an option, to wrap http call in a block, but easier to use would be a configuration setting function call - then just do http requests as normal. eg.,

library(webmiddens)
# 1 hr expiry
midden$new(path = "foobar", expire = 3600)

# then http requests are using webmiddens
crul::HttpClient$new("https://httpbin.org")$get("get", query = list(foo = "bar"))

Not sure this is possible. We'd have to have a specific hook within crul for this

Caching http requests and CRAN policies

Hi @sckott ,

Thanks for this great package, I started using and wanted to know if the following CRAN policy also apply to webmiddens or any other package to cache http requests:

Packages should not write in the user’s home filespace (including clipboards), nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.

Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

Thanks again

cc @hrbrmstr

expiry: different options

right now the only way expire time works is across all stubbed/cached requests, that is, if an expire time is set, then on the next http request using webmiddens, all stubs are deleted that are expired

i imagine more fine grained control may be useful:

  • set expire time for all http requests matching a given http request pattern - i guess via any of the webmockr options, e.g., a url pattern, or a header pattern, or a body pattern
  • a user of a package foobar may want to set different expire times for each function of a package they're using, e.g. 10 sec for fxn1(), 1 minute for fxn2(), and 1 week for fxn3()
  • what else?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.