sckott / webmiddens Goto Github PK

View Code? Open in Web Editor NEW

10.0 4.0 1.0 135 KB

cache http requests

Home Page: https://sckott.github.io/webmiddens

License: Other

R 97.28% Makefile 2.72%

http api web caching http-cache http-mocking r rstats fakeweb

webmiddens's Introduction

webmiddens

simple caching of HTTP requests/responses, hooking into webmockr (https://github.com/ropensci/webmockr) for the HTTP request matching

A midden is a debris pile constructed by a woodrat/pack rat (https://en.wikipedia.org/wiki/Pack_rat#Midden)

the need

vcr is meant really for testing, or script use. i don't think it fits well into a use case where another pkg wants to cache responses
memoise seems close-ish but doesn't fit needs, e.g., no expiry, not specific to HTTP requests, etc.
we need something specific to HTTP requests, that allows expiration handling, a few different caching location options, works across HTTP clients, etc
caching just the http responses means the rest of the code in the function can change, and the response can still be cached
- the downside, vs. memoise, is that we're only caching the http response, so if there's still a lot of time spent processing the response, then the function will still be quite slow - BUT, if the HTTP response processing code is within a function, you could memoise that function
memoise is great, but since it caches the whole function call, you don't benefit from individually caching each http request, which we do here. if you cache each http request, then any time you do that same http request, it's response is already cached

brainstorming

use webmockr to match requests (works with crul; soon httr)
possibly match on, and expire based on headers: Cache-Control, Age, Last-Modified, ETag, Expires (see Ruby's faraday-http-cache (https://github.com/plataformatec/faraday-http-cache#what-gets-cached))
caching backends: probably all binary to save disk space since most likely we don't need users to be able to look at plain text of caches
expiration: set a time to expire. if set to 2019-03-08 00:00:00 and it's 2019-03-07 23:00:00, then 1 hr from now the cache will expire, and a new real HTTP request will need to be made (i.e., the cache will be deleted whenever the next HTTP request is made)

http libraries

right now we only support crul, but httr support should arrive soon

installation

remotes::install_github("sckott/webmiddens")

use_midden()

library(webmiddens)
library(crul)

Let's say you have some function http_request() that does an HTTP request that you re-use in various parts of your project or package

http_request <- function(...) {
  x <- crul::HttpClient$new("https://httpbin.org", opts = list(...))
  x$get("get")
}

And you have a function some_fxn() that uses http_request() to do the HTTP request, then proces the results to a data.frame or list, etc. This is a super common pattern in a project or R package that deals with web resources.

some_fxn <- function(...) {
  res <- http_request(...)
  jsonlite::fromJSON(res$parse("UTF-8"))
}

Without webmiddens the HTTP request happens as usual and all is good

some_fxn()
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip, deflate"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "libcurl/7.74.0 r-curl/4.3 crul/1.0.2.92"
#> 
#> $headers$`X-Amzn-Trace-Id`
#> [1] "Root=1-5fd29de8-0e978093689e02246d0b3d92"
#> 
#> 
#> $origin
#> [1] "24.21.229.59"
#> 
#> $url
#> [1] "https://httpbin.org/get"

Now, with webmiddens

run wm_configuration() first to set the path where HTTP requests will be cached

wm_configuration("foo1")

#> configuring midden from $path

first request is a real HTTP request

res1 <- use_midden(some_fxn())
res1
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip, deflate"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "libcurl/7.74.0 r-curl/4.3 crul/1.0.2.92"
#> 
#> $headers$`X-Amzn-Trace-Id`
#> [1] "Root=1-5fd29de8-3ad69a2f59e45afc48446e85"
#> 
#> 
#> $origin
#> [1] "24.21.229.59"
#> 
#> $url
#> [1] "https://httpbin.org/get"

second request uses the cached response from the first request

res2 <- use_midden(some_fxn())
res2
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip, deflate"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "libcurl/7.74.0 r-curl/4.3 crul/1.0.2.92"
#> 
#> $headers$`X-Amzn-Trace-Id`
#> [1] "Root=1-5fd29de8-65506d0055d8b5c874949851"
#> 
#> 
#> $origin
#> [1] "24.21.229.59"
#> 
#> $url
#> [1] "https://httpbin.org/get"

the midden class

x <- midden$new()
x # no path
#> <midden> 
#>   path: 
#>   expiry (sec): not set
# Run $init() to set the path
x$init(path = "forest")
x
#> <midden> 
#>   path: /Users/sckott/Library/Caches/R/forest
#>   expiry (sec): not set

The cache slot has a hoardr object which you can use to fiddle with files, see ?hoardr::hoard

x$cache
#> <hoard> 
#>   path: forest
#>   cache path: /Users/sckott/Library/Caches/R/forest

Use expire() to set the expire time (in seconds). You can set it through passing to expire() or through the environment variable WEBMIDDENS_EXPIRY_SEC

x$expire()
#> NULL
x$expire(5)
#> [1] 5
x$expire()
#> [1] 5
x$expire(reset = TRUE)
#> NULL
x$expire()
#> NULL
Sys.setenv(WEBMIDDENS_EXPIRY_SEC = 35)
x$expire()
#> [1] 35
x$expire(reset = TRUE)
#> NULL
x$expire()
#> NULL

FIXME: The below not working right now - figure out why

wm_enable()
con <- crul::HttpClient$new("https://httpbin.org")
# first request is a real HTTP request
x$r(con$get("get", query = list(stuff = "bananas")))
# following requests use the cached response
x$r(con$get("get", query = list(stuff = "bananas")))

verbose output

x <- midden$new(verbose = TRUE)
x$init(path = "rainforest")
x$r(con$get("get", query = list(stuff = "bananas")))

set expiration time

x <- midden$new()
x$init(path = "grass")
x$expire(3)
x

Delete all the files in your "midden" (the folder with cached files)

x$cleanup()

Delete the "midden" (the folder with cached files)

x$destroy()

webmiddens's People

Contributors

Stargazers

Watchers

Forkers

jimsforks

webmiddens's Issues

why WIP and ropensci org

I was wondering whether a WIP repo shouldn't better live in ropenscilabs?

things that could be better

its confusing knowing how to deal with expiry time. if you include expire in 1st request, but not in the 2nd exact same request, does the expiry setting from the 1st apply to the 2nd one?
we need a way to simply wrap a package/script fxn in a webmiddens fxn instead of having to make the user modify the internals of their code , eg.,

# some fxn that does an http request
some_fxn <- function(x) {
  some_http_request(x)
}

# simply wrap it in a webmiddens fxn, with settings with a separte fxn perhaps
library(webmiddens)
web_settings(dir = "foo/bar", expire = 3)
cache(some_fxn(...))
# and can override session level settings
cache(some_fxn(...), expire = 5)

package specific expiration times

brought up by maëlle, ropensci/rcrossref#202 (comment)

Doing this with env vars might be tricky, but easier with a function call.

cache backend options

in memory: an obvious choice needed
on disk: should be default for the majority use case of using this in other pkgs where you want users to have responses cached ACROSS R sessions (in memory no good in this case)

first test case: rcrossref

via ropensci/rcrossref#185

env vars to control pkg behavior

turn all caching on or off (WEBMIDDENS_TURN_OFF), takes logical
set the expiry time via env var (WEBMIDDENS_EXPIRY_SEC), takes integer

expiry functionality

...

One remark on the README

the downside, vs. memoise, is that we're only caching the http response, so if there's still a lot of time spent processing the response, then the function will still be quite slow

But you could still memoise that response processing part if you make it a function. 😉

expiring with response headers

right now the only method we have is time based: number of seconds since the request was recorded.

perhaps we can use response headers, following https://github.com/sourcelevel/faraday-http-cache :

cache-control: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
age: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Age
expires: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expires
last-modified: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified
- if-modified-since: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since
- if-unmodified-since: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Unmodified-Since
- if-none-match: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match
e-tag: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag

p.s.

pragma: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Pragma (only recommended for backwards compatibility with http 1.0 clients)

Seems that the folowing are often found together:

cache-control and age
last-modified, if-modified-since, if-unmodified-since
etag

canonical ref perhaps: rfc2616 https://tools.ietf.org/html/rfc2616#page-108
how httr gathers caching info https://github.com/r-lib/httr/blob/master/R/cache.R - not used in normal http requests in the pkg though

possible to spin off writing cache to disk in another R session?

so that the user called function can be done more quickly

Any way to not require HTTP requests to be wrapped in a code block?

Right now, we need to wrap HTTP call in midden$call()

Maybe we keep that as an option, to wrap http call in a block, but easier to use would be a configuration setting function call - then just do http requests as normal. eg.,

library(webmiddens)
# 1 hr expiry
midden$new(path = "foobar", expire = 3600)

# then http requests are using webmiddens
crul::HttpClient$new("https://httpbin.org")$get("get", query = list(foo = "bar"))

Not sure this is possible. We'd have to have a specific hook within crul for this

Caching http requests and CRAN policies

Hi @sckott ,

Thanks for this great package, I started using and wanted to know if the following CRAN policy also apply to webmiddens or any other package to cache http requests:

Packages should not write in the user’s home filespace (including clipboards), nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.

Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

Thanks again

cc @hrbrmstr

make cache path specific to the package the caching is done in

see ropensci/rcrossref#185

right now, the user sets the cache path with wm_configuration("foo2") which sets the last part of the path, e.g., /home/jane/.cache/R/foo2

could we have wm_configuration() put the package name in the cache path too, so

/home/jane/.cache/R/rcrossref/foo2 instead of /home/jane/.cache/R/foo2

enabling webmiddens w/ enable() should work without having to load webmiddens

see ropensci/rcrossref#185 (comment)

perhaps something to do with where the mdenv environment is instantiated, perhaps put in onload, or somewhere else?

expiry: different options

right now the only way expire time works is across all stubbed/cached requests, that is, if an expire time is set, then on the next http request using webmiddens, all stubs are deleted that are expired

i imagine more fine grained control may be useful:

set expire time for all http requests matching a given http request pattern - i guess via any of the webmockr options, e.g., a url pattern, or a header pattern, or a body pattern
a user of a package foobar may want to set different expire times for each function of a package they're using, e.g. 10 sec for fxn1(), 1 minute for fxn2(), and 1 week for fxn3()
what else?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.