r-lib / httr2 Goto Github PK

View Code? Open in Web Editor NEW

223.0 11.0 51.0 13.13 MB

Make HTTP requests and process their responses. A modern reimagining of httr.

Home Page: https://httr2.r-lib.org

License: Other

R 99.97% Shell 0.03%

r http

httr2's Introduction

httr2

httr2 (pronounced hitter2) is a ground-up rewrite of httr that provides a pipeable API with an explicit request object that solves more problems felt by packages that wrap APIs (e.g. built-in rate-limiting, retries, OAuth, secure secrets, and more).

Installation

You can install httr2 from CRAN with:

install.packages("httr2")

Usage

To use httr2, start by creating a request:

library(httr2)

req <- request("https://r-project.org")
req
#> <httr2_request>
#> GET https://r-project.org
#> Body: empty

You can tailor this request with the req_ family of functions:

# Add custom headers
req |> req_headers("Accept" = "application/json")
#> <httr2_request>
#> GET https://r-project.org
#> Headers:
#> • Accept: 'application/json'
#> Body: empty

# Add a body, turning it into a POST
req |> req_body_json(list(x = 1, y = 2))
#> <httr2_request>
#> POST https://r-project.org
#> Body: json encoded data

# Automatically retry if the request fails
req |> req_retry(max_tries = 5)
#> <httr2_request>
#> GET https://r-project.org
#> Body: empty
#> Policies:
#> • retry_max_tries: 5

# Change the HTTP method
req |> req_method("PATCH")
#> <httr2_request>
#> PATCH https://r-project.org
#> Body: empty

And see exactly what httr2 will send to the server with req_dry_run():

req |> req_dry_run()
#> GET / HTTP/1.1
#> Host: r-project.org
#> User-Agent: httr2/1.0.0.9000 r-curl/5.2.1 libcurl/8.4.0
#> Accept: */*
#> Accept-Encoding: deflate, gzip

Use req_perform() to perform the request, retrieving a response:

resp <- req_perform(req)
resp
#> <httr2_response>
#> GET https://www.r-project.org/
#> Status: 200 OK
#> Content-Type: text/html
#> Body: In memory (6854 bytes)

The resp_ functions help you extract various useful components of the response:

resp |> resp_content_type()
#> [1] "text/html"
resp |> resp_status_desc()
#> [1] "OK"
resp |> resp_body_html()
#> {html_document}
#> <html lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n    <div class="container page">\n      <div class="row">\n       ...

Major differences to httr

You can now create and modify a request without performing it. This means that there’s now a single function to perform the request and fetch the result: req_perform(). req_perform() replaces httr::GET(), httr::POST(), httr::DELETE(), and more.
HTTP errors are automatically converted into R errors. Use req_error() to override the defaults (which turn all 4xx and 5xx responses into errors) or to add additional details to the error message.
You can automatically retry if the request fails or encounters a transient HTTP error (e.g. a 429 rate limit request). req_retry() defines the maximum number of retries, which errors are transient, and how long to wait between tries.
OAuth support has been totally overhauled to directly support many more flows and to make it much easier to both customise the built-in flows and to create your own.
You can manage secrets (often needed for testing) with secret_encrypt() and friends. You can obfuscate mildly confidential data with obfuscate(), preventing it from being scraped from published code.
You can automatically cache all cacheable results with req_cache(). Relatively few API responses are cacheable, but when they are it typically makes a big difference.

Acknowledgements

httr2 wouldn’t be possible without curl, openssl, jsonlite, and jose, which are all maintained by Jeroen Ooms. A big thanks also go to Jenny Bryan and Craig Citro who have given me much useful feedback on both the design of the internals and the user facing API.

httr2's People

Contributors

Stargazers

Watchers

Forkers

maxheld83 isabella232 yutannihilation nxskok koderkow hongooi73 adithirgis jameslairdsmith flahn jonthegeek jchrom dmerch boshek csdaw fangzhou-xie mgirlich wing328 philipp-baumann fkohrt spotrh dpprdan judith-bourque cstepper howardbaek mlopez-ibanez elipousson casa-henrym michaelchirico gvelasq atheriel nealrichardson sermetpekin allenlile jbgruber takewiki fh-mthomson taerwin magicdefender ake123 gacolitti salim-b mshelm botan jl5000 dyfanjones romainfrancois mkoohafkan gauravcodepro steveputman burgerga

httr2's Issues

Consider the use of restarts to handle flaky internet

i.e. failure before we get to the http request

URL substitution

Think about the common patterns used to document REST APIs, i.e. the colon signals a placeholder, and hooking this up nicely to glue(). Relates to glue transformers.

GitHub:
GET /repos/:owner/:repo/topics

Canvas:
DELETE /api/v1/courses/:course_id/discussion_topics/:topic_id

serialize NULL as null for JSON

In httr2, PUT/POST/PATCH with a JSON body should turn a NULL field into a JSON null, not an empty object. The current behaviour makes it unnecessarily hard to interface with REST APIs that distinguish between nulls and empty objects.

This is arguably a wart in jsonlite::toJSON, but changing that or httr now would probably break too much code. Since httr2 will be a new package, there shouldn't be any legacy issues.

# current behaviour, as seen in httr:::body_config
jsonlite::toJSON(list(a=1, b=list(), c=list(x=2, y=NULL)), auto_unbox=TRUE)
# {"a":1,"b":[],"c":{"x":2,"y":{}}}

# better behaviour
jsonlite::toJSON(list(a=1, b=list(), c=list(x=2, y=NULL)), auto_unbox=TRUE, null="null")
# {"a":1,"b":[],"c":{"x":2,"y":null}}

Check first arguments

And give clear error message if resp_ doesn't get a response object/req_ doesn't get a request object.

Pluggable error handling

So you can automatically turn http status codes into R errors

Throttling

https://github.com/SerpentAI/requests-respectful

Rename req_fetch() to req_perform()?

Can we use multifetch in req_fetch()?

Two big advantages:

only have to maintain one code path, not one in req_fetch() and one in req_multi_fetch()
easier to do progress bars which are otherwise challenging inside of curl callbacks

Will need to add additional queue of requests that need to be added to the pool in the future. And keep looping until both pool and queue are empty.

Broadly consider redaction

When is a user likely to accidental leak confidential information when trying to get help?

What else needs to be redacted apart from the Authorization header?
Should req_dry_run() redact?
Should req_verbose() redact?

Make caching easier

Make it possible to designate some cache directory for storing body content.

Retrying (incl rate limits)

If nothing else, this blog post capture lots of important ergonomic considerations:

https://tech.channable.com/posts/2020-02-05-opnieuw.html

Rename req() to request()

To match response()

Check Azure certificate based auth

https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow#redeem-a-code-for-an-access-token

Something like this:

client_id <- "my_client_id"
claims <- list(
  aud = "https://login.microsoftonline.com/{tenant}/v2.0",
  iss = client_id,
  sub = client_id
)
oauth_app(
  oauth_client("id"),
  endpoints = c(
    token = "https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token"
  ),
  auth = "jwt_enc",
  auth_params = list(
    claims = claims,
    key = "path/to/private/key",
    header = list(
      x5t = base64url_encode(openssl::sha1(openssl::read_cert("path/to/certificate")))
    )
  )
)

Create apps at https://portal.azure.com/#blade/Microsoft_AAD_RegisteredApps/ApplicationsListBlade

See code in https://github.com/Azure/AzureAuth/blob/6f6ecf4ea47e3da35afe1e05e5c5a196a00b8aae/R/cert_creds.R

Pagination

Help with auto-traversal, as we do in gh. There are some pretty standard ways of doing this.

httr2

httr is a library for two main tasks: creating http requests and parsing the responses. Currently this dichotomoy is a little muddled because:

There is no explicit request object - it is only created internally by
the request functions. This tends to lead to large request functions that
have many arguments passed in ...
There's no consistent naming scheme that indicates whether a function
works with a request or a response. This is particuarly confusing for
functions like content_type(): does it set the content type of
the request or extract the content type of the response?

Additionally, httr was designed prior to the pipe and so uses ... rather than functions that modify an object. This makes the API feel rather different to similar APIs (e.g. rvest). It also makes it harder to test, because you can only easily test the result of issuing a request, not the internal request object. Overall the API feels a little dated, and a little underdesigned.

Request API

Basics

req("http://url.com/") %>% req_fetch()

# req_fetch could be a generic so you could still do
req_fetch("http://url.com/")

Performing the request

The HTTP method doesn't affect the input arguments or the output type, so that suggests that the key API verb should be the output, rather than the HTTP method:

req("http://url.com") %>% req_fetch()
req("http://url.com") %>% req_async()
req("http://url.com") %>% req_save(path = path)
req("http://url.com") %>% req_stream(fun = fun)

The request verb would default to GET, unless a body was set, in which case it would change to POST. Otherwise, you could override yourself in two ways:

req("http://url.com") %>% req_method("DELETE") %>% req_fetch()
req("http://url.com") %>% req_fetch("DELETE")

The first form would be most useful when generating partial requests for API wrappers.

Url

# Would generally make smaller functions that took multiple arguments
# This would make iterative url construction much nicer for APIs
req("http://url.com/") %>% 
  req_path("a", username, "b") %>%  # replaces
  req_path_suffix("y") %>%  # appends
  req_path_prefix("x") %>%  # for completeness
  req_query(y = 1) %>% 
  req_params() %>% 
  req_fetch()

# req would need to parse the url so you could start with
req("http://url.com/api/v1/?q=2")

Body

# This would also lead to a nicer API for bodies
req("http://url.com/") %>% 
  req_body_json(a = 1, b = 2, c = 3) %>% 
  req_fetch() 

# Setting body would change default request type, but you could
# override
req("http://url.com/") %>% 
  req_body_json(a = 1, b = 2, c = 3) %>% 
  req_fetch("PUT") 

req("http://url.com/") %>% 
  req_body_json(a = 1, b = 2, c = 3) %>% 
  req_save(path = "~/Desktop/bigfile.blah") 
  # If path is directory, automatically add name from url?

# req_body_file()
# req_body_form()
# req_body_json()
# req_body_multipart()
# req_body_raw()

Authentication

req("http://url.com/") %>% req_auth_basic()
req("http://url.com/") %>% req_auth_oauth1()
req("http://url.com/") %>% req_auth_oauth2()

Headers

req("http://url.com/") %>% 
  req_header(`Content-type` = "application/json") %>% 
  req_header(a_list_from_somewhere_else) # would unlist() inputs as needed.
# list(...) %>% flatten() %>% map_chr(as.character())

Curl

# And there would be a new function for setting specific curl requests
# These would be applied after other request parameters
req_config()

Response API

Basics

resp <- req("http://google.com") %>% req_fetch()

# Pipeable API to check that the response is what you expect
resp %>% 
  resp_check_ok() %>% 
  resp_check_body_xml() %>% 
  resp_content_xml()

# Other functions extract headers
resp %>% resp_headers()
resp %>% resp_content_type()

# And other http components
resp %>% resp_status()
resp %>% resp_url()
resp %>% resp_timings()

Content

resp %>% resp_content_raw()
resp %>% resp_content_text(encoding = "UTF-8")

# Would rely on user to check content type using helper from above
resp %>% resp_content_json()
resp %>% resp_content_xml()
resp %>% resp_content_html()
resp %>% resp_content_png()
resp %>% resp_content_jpeg()

Would not have resp_content_auto() because I think time has shown that this is a bad idea.

New package?

Should be a new package, httr2?

Pros:

Can start from scratch without having to work with existing API
Could aim for high unit test coverage from the beginning.
Documentation will be less confusing because it doesn't have to describe
two APIs.

Cons:

May re-introduce bugs because I miss important logic
Will have to maintain two packages in the short term (in the long term
would deprecate httr).

I think the pros probably outweigh the cons - the API will a sufficiently large change that it's worth starting from scratch.

@craigcitro, @jeroenooms, @jennybc I'd love your thoughts on this, if you have a little spare time.

Checking content type

In resp_body_json() the only accepted media type seems to be application/json. Many APIs have vendor specific types, e.g. application/vnd.github-issue.text+json (see other examples on Swagger or the RFC 6838).
It would be great if these media types were also recognized as valid json media type.

Curl option parsing

Following the outline of http://github.com/hrbrmstr/curlconverter

Figure out shiny integration

r-lib/gargle#157

Need to ensure that expiry is a parameter of the setup; the current default sets the lifetime to null, which means that the cookies will be expired on next log in.
Cookies are encrypted so that they can't be read by browser; need to double check encryption if there's no client secret. Might be adequate to encrypt with common httr2 key?
Double check that cookies are scoped to given path so they only apply for one app.
PR to shiny to provide https://github.com/r-lib/gargle/pull/157/files#diff-169b8f234d0b208affb106fce375f86fefe2f16dba4ad66495a1dc06c8a4cd7bR145-R185

Code in PR currently uses OAuth as gate to access app; might also want to use it as optional feature (i.e. log in to save this file to your google drive), so will also need to work out that flow.

Rename obfuscated to hidden

And document by itself

Lightweight json class?

e.g.

print.httr2_json <- function(x, ...) {
  "List from json"
  jsonlite::fromJSON(x, auto_unbox = TRUE, pretty = TRUE)
}

Because it's a relatively compact format. OTOH maybe this will be confusing because it's not obvious you can treat it like a list?

Think about encryption

Basic with built-in password to avoid storing client_secret in plain text
Password argument to auth or cache to make it harder for a different app to use your tokens

Switch to jose?

Since it'll handle more algorithms

resp_check_status() should parse WWW-Authentication header

When status is 401 or 403. See examples in https://datatracker.ietf.org/doc/html/rfc6750#section-3

Check can add multiple headers

Whether it's an option or default behaviour

Read from stream

Provide timeout, optionally inf
Callback function + chunk size
Callback function has way to gracefully terminate

Multi-VERB partials

In wrapper packages, it's common to create thin wrappers around several httr::VERB()s, and use a common, e.g., base URL, user agent, token-injecting strategy, etc. Think about the best pattern for facilitating this. Could possibly extend to other API-wide aspects, like batching, retries, throttling. Make it easier to propagate shared data and policies.

Add obfuscated support to url params and body data

Support for OAuth 2.0 device code flow

Quick question, does httr support this natively? The idea is that you contact the endpoint. Then it sends back a link, which you open in your browser and enter a code which you get from a device. Once that's done, your app gets the token.

The IETF draft is here: https://tools.ietf.org/html/draft-ietf-oauth-device-flow-07

And an outline of how it works for Azure AD is here: https://joonasw.net/view/device-code-flow

HTTP state handler / HTTP event handler

Discussion/Idea

For the the interpretation of robots.txt files it turns out that a lot of the interpretation depends on the status of the HTTP-request: server error, client error, returning a robots.txt file or some other format, ...

I put some effort into designing a callback / event handler system. It will do but its all but elegant.

The question is, if handling different states/events and status messages is a broad enough problem to maybe get handled in httr(2) in a better thought through, consistent and robust manner.

States and events I have got to handle (just to give an idea of the problem space):

404
client error other than 404
5xx (aka server error)
mime type is not "text/plain"
response data looks like HTML, XML or JSON (aka it does not look like plain text)
redirects without domain change
redirect to subdomain www
redirects with domain change

Pluggable auth

See https://www.python-httpx.org/advanced/#customizing-authentication for inspiration

Will need to be flexible enough to power the various OAuth flows.

Automatically remove old tokens

I think it makes sense to have some default expiration policy, so that tokens that haven't been used for x days (maybe default to 30?) are automatically deleted.

Multiple request interface

List of requests + optional list of paths — retrieve in parallel.

Redesign OAuth

Break down into smaller pieces since there's so much variation across sites
Copy token caching from gargle/rtweet
Provide more flow

Implement JWT + OAuth

rfc7523

JWT for client authentication (as used by azure)
JWT flow (as used by google)

Extract query parameters

I spend a few minutes to try out httr2. Great work with the new oauth system. I didn't really managed to use oauth with httr but it only took me two minutes to make it work in httr2 😄

I want to extract the url query paramters of a request and wondered whether there is an easier way than the following

library(httr2)
req <- request("https://example.com/path/to/page?name=ferret&color=purple")

req_queries <- url_parse(req$url)$query
req_queries$name
#> [1] "ferret"

^{Created on 2021-07-29 by the reprex package (v2.0.0)}

OAuth as a new package?

@jennybc and I discussed this just a bit in DM's - she suggested opening an issue here.

The major feature crul does not have that httr does have is OAuth. It's not a common use case in scientific web resources, so there's not a big push for me to support it. However, it would be great if there was an easy way for crul users to incorporate OAuth with a separate package.

There's only two higher level http libraries now, but surely more http libraries will come along in the future, and maybe they'll even be httr and httr2 co-existing.

Breaking out OAuth into a separate library that could be integrated into any http library seems to make sense. In python theres https://github.com/oauthlib/oauthlib and ruby has https://github.com/oauth-xx/oauth2 (i may be missing some other libs that are more widely used)

From what Jenny tells me, OAuth in httr is pretty integrated into the package, so maybe it isn't that easy to do?

API wrapping vignette

Basics of core function: request, possibly template.
How to improve error handling with req_error()
Auth (including basic OAuth discussion)

Return mocked responses

With httr we can hook into requests via set_callback, which I use to make webmockr and vcr work. Can there be something similar here?

cc @maelle

Think about post-hoc retrieval of errorful responses

i.e. from last_error()$resp; need to make sure that oauth_flow_abort() also includes

Consider redacting Authorization header.

secret_has_key()

Which you can use to disable vignette blocks etc.

Consider moving endpoints to individual flows

And auth to the client. Also add key to client as in #33, and then think about whether auth_params could just be .... Then the app object could go away.

# old
client <- oauth_client(
  "732991327087-cdl705uujluehert8a47rhr0umetg5ut.apps.googleusercontent.com",
  obfuscated("RmOi9uaoCNx8o6XmVEMU9A_fopiN5-iQ")
)
app <- oauth_app(client, endpoints = c(
  authorization = "https://accounts.google.com/o/oauth2/auth",
  token = "https://accounts.google.com/o/oauth2/token",
  device_authorization = "https://oauth2.googleapis.com/device/code"
))

oauth_flow_auth_code(app, scope = "https://www.googleapis.com/auth/userinfo.email")
oauth_flow_device(app, scope = "https://www.googleapis.com/auth/userinfo.email")

# new
client <- oauth_client(
  "732991327087-cdl705uujluehert8a47rhr0umetg5ut.apps.googleusercontent.com",
  obfuscated("RmOi9uaoCNx8o6XmVEMU9A_fopiN5-iQ")
)
oauth_flow_auth_code(client, 
  auth_url = "https://accounts.google.com/o/oauth2/auth",
  token_url = "https://accounts.google.com/o/oauth2/token",
  scope = "https://www.googleapis.com/auth/userinfo.email"
)
oauth_flow_device(app, 
  auth_url = "https://oauth2.googleapis.com/device/code",
  token_url = "https://accounts.google.com/o/oauth2/token",
  scope = "https://www.googleapis.com/auth/userinfo.email"
)

That makes it much more clear what urls each flow requires, and would simplify oauth_flow_check_app().

If you're providing multiple auth flows for a single API, you'd need to avoid repeating the token URLs in some other way.

req_auth_basic() should use askpass

Redact `api_key` in query params

Since it seems like a common way to authenticate

req_dry_run()
print.httr_request()
print.httr2_response()
req_verbose()

obfuscate_key <- local({
  key <- as.raw(...)
  function() key
})
attr(unobfuscate_key, "srcref") <- "function(x) {}"

r-lib / httr2 Goto Github PK

httr2's Introduction

httr2

Installation

Usage

Major differences to httr

Acknowledgements

httr2's People

Contributors

Stargazers

Watchers

Forkers

httr2's Issues

Request API

Basics

Performing the request

Url

Body

Authentication

Headers

Curl

Response API

Basics

Content

New package?

Recommend Projects

Recommend Topics

Recommend Org