doi-usgs / sbtools Goto Github PK

View Code? Open in Web Editor NEW

21.0 8.0 21.0 5.63 MB

Tools for interfacing R with ScienceBase data services.

Home Page: http://doi-usgs.github.io/sbtools/

License: Creative Commons Zero v1.0 Universal

R 67.77% TeX 32.23%

r usgs sciencebase rstats r-package

sbtools's Introduction

ScienceBase R Tools

Tools for interfacing R with ScienceBase data services.

Package Description

This package provides a rich interface to USGS’s ScienceBase, a data cataloging and collaborative data management platform. For further information, see the sbtools manuscript in The R Journal (USGS IP-075498). See citation('sbtools') for how to cite the package.

Recommended Citation:

  Winslow, LA, S Chamberlain, AP Appling, and JS Read. 2016. sbtools: 
  A package connecting R to cloud-based data for collaborative online 
  research. The R Journal 8:387-398.

Package source code DOI: https://doi.org/10.5066/P912NGFV

Linux	Test Coverage

Current CRAN information

Version	Monthly Downloads	Total Downloads

Package Installation

To install the sbtools package, you must be using R 3.0 or greater and run the following command:

install.packages("sbtools")

To get cutting-edge changes, install from GitHub using the devtools packages:

remotes::install_github("DOI-USGS/sbtools")

Reporting bugs

Please consider reporting bugs and asking questions on the Issues page:

https://github.com/DOI-USGS/sbtools/issues

Release Procedure

For release of the sbtools package, a number of steps are required.

Ensure all checks pass and code coverage is adequate.
Ensure NEWS.md reflects updates in version.
Update DESCRIPTION to reflect release version.
Convert DISCLAIMER.md to approved language and rebuild README.Rmd.
Create release candidate branch and commit release candidate.
Build source package and upload to CRAN.
Once accepted to CRAN, tag release candidate branch an push to repositories.
Change DISCLAIMER.md back to development mode and increment description version.
Merge release candidate and commit.
Open PR/MR in development state.

Disclaimer

This software is preliminary or provisional and is subject to revision. It is being provided to meet the need for timely best science. The software has not received final approval by the U.S. Geological Survey (USGS). No warranty, expressed or implied, is made by the USGS or the U.S. Government as to the functionality of the software and related material nor shall the fact of release constitute any such warranty. The software is provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the software.

sbtools's People

Contributors

Stargazers

Watchers

sbtools's Issues

abstract upload data.frame to file on SB

Deals with compression (if necessary)

append?

query_item_identifier fails to return an existing item

I think query_item_identifier is supplying a new, incorrect session when the session arg is not explicitly set. This may be because this test passes if session is undeclared in the args, even though the default to session is current_session():

if (missing(session) || is.null(session)) {
    session = handle(pkg.env$url_base)
}

So there might be two problems. 1, the session gets reset to handle(pkg.env$url_base) even when you think you're choosing the default arg value. 2, the value that session gets set to is different from the value returned by current_session(). See:

> httr::handle(sbtools:::pkg.env$url_base)
# Host: https://www.sciencebase.gov/catalog/ <0x0000000011f31b10>
> current_session()
# Host: https://www.sciencebase.gov/catalog/ <0x0000000011cdbc00>

Not sure why that different session w/ same URL is tripping things up, but this turns into a problem for me in continental stream metabolism, where I created this item:

# find the project root ("Continental Stream Metabolism" folder)
true_sites_root <- sbtools::query_item_identifier(scheme="mda_streams", type="project_root", key="uber")$id
project_root <- sbtools::item_get_parent(true_sites_root)

# create a sandbox sites folder to work with from here on
sites_root <- sbtools::item_create(parent_id = project_root, title="Sites_dev")
sbtools::item_update_identifier(id=sites_root, scheme="mda_streams_dev", type="sites_root", key="uber") # true sites root currently has type="project_root". i find that confusing.
sites_root_saved <- sites_root
> sites_root_saved
# [1] "55568a6fe4b0a92fa7e9cf2d"

then i tried to find it again, but got an empty data.frame:

sites_root <- sbtools::query_item_identifier(scheme="mda_streams_dev", type="sites_root", key="uber")
> sites_root
# data frame with 0 columns and 0 rows

but if i explicitly declare session=current_session(), it works.

sites_root <- sbtools::query_item_identifier(scheme="mda_streams_dev", type="sites_root", key="uber", session=current_session())
> sites_root
#       title                       id
#1 Sites_dev 55568a6fe4b0a92fa7e9cf2d

item_list_files returns NA when site doesn't exist

instead of NULL or df with nrow = 0.

when

item_list_files(data.frame(), session = NULL)

support upsert (or PUT) for modifying a file attachment

tried this with item_append_file(), but it supports an additional file with the exact same name (does not overwrite).

authenticate should fail w/o username

prompt shows up anyway. Should fail before that.

add a get_parent function?

Does this fit in sbtools?

quick way to do it is this:

get_parent = function(item_id){

  url <- sprintf("https://www.sciencebase.gov/catalog/item/%s?format=json&fields=parentId", item_id)

  parent_id <- fromJSON(txt = url)$parentId

  url <- sprintf("https://www.sciencebase.gov/catalog/item/%s?format=json&fields=title", parent_id)
  parent_site <- fromJSON(txt = url)$title
  return(parent_site)
}

but we would ask for item json twice per id. Any better way to do this?

get smarter about using httr::content()

From httr::content:
When using content() in a package, DO NOT use on as = "parsed". Instead, check the mime-type is what you expect, and then parse yourself. This is safer, as you will fail informatively if the API changes, and you will protect yourself against changes to httr.

See r-lib/httr#246

Currently, this is speckled throughout the package.

check_auth function should only check and re-auth if !is.null(session)

item_list_files(id, session) returns d.f with NAs when no files exist

should be d.f w/ zero rows per design pattern

item_list_files(id = '54876132e4b02acb4f0c87da', session)
  fname size url
1    NA   NA  NA

Use EML to automatically generate metadata?

password prompt doesn't use password param

authenticate_sb

ncdf

saw mention of ncdf in the proposal and had a comment (and I don't work with ncdf formats much, so I'm kinda clueless here):

I hear from Roy Mendelssohn at NOAA that the ncdf package http://cran.r-project.org/web/packages/ncdf/index.html only works with older ncdf format. True? But this pkg is nice b/c it installs on all OS'es.

Roy said to instead use ncdf4 http://cran.r-project.org/web/packages/ncdf4/index.html - but there are no windows binaries, and it sounds as though there never will be.

Thoughts on this? I ask b/c if we need ncdf4 functionality, that could lead to a problem for windows users that aren't super savvy (aka that couldn't do install from source, etc.)

define data query interface for generic searches

Use case is wanting to find public data based on spatial and/or keywords and type etc.

item_list_files() should handle a NULL (or missing) session

for public items.

support metadata entry more explicitly

create function to modify privacy of item

item_upload_create should support an arg for "title"

right now it creates the item with the same name as the file.

Add disclaimer to readme

item_get_files function?

should be added? works on item_list_files but w/ a GET()

Restrict httr version > 0.5 and RCurl version > 1.95

handle json parsing errors elegantly?

Error in parseJSON(txt) : parse error: premature EOF {"total":1,"selflink":{"rel":"s (right here) ------^

doc on query_item_id wrong for no match return

Says it returns a NULL when no matching item found:

query_item_identifier(scheme = 'mda_streams', type= NULL, key = sites, session, limit = 10000)
data frame with 0 columns and 0 rows

example on item_list_children returns d.f of NAs

item_list_children('5060b03ae4b00fc20c4f3c8b')
   id
1  NA
2  NA
3  NA
4  NA
5  NA
6  NA
7  NA
8  NA
9  NA
10 NA
11 NA
12 NA
13 NA
14 NA
15 NA
16 NA
17 NA
18 NA
19 NA
20 NA
There were 40 warnings (use warnings() to see them)

item_append_files example is wrong

## Not run: 
 sb_session<-authenticate_sb(sbusername)
 item_append_files("54e265a4e4b08de9379b4dfb", '/foo/bar/baz.zip', sb)

support destinations=missing for item_file_download

would default to using names and dest_dir

test for success for POST/PUT?

see DOI-USGS/mda.streams#32

File Download feedback

from @dblodgett-usgs

would be nice if you download files function returned the paths to the files or at least the file names it downloads instead of true, but that’s minor.

We could return a list of all files downloaded with their paths, regardless of if the user supplied file names or if names were generated. This would be more useful than TRUE.

get sandbox test account from SB

improve auth usability

Might take a look at what we are doing w/ hazardItems auth

After auth works, we set pkg.env token and username, and do a token check when session is needed.

Add testing suite

Ability to add curl options

check for files before remove

you get a 404 if the item has >0 files associated with it. Currently checks for children but not files

refresh birthday of session attribute with successful POST or GET

internal tracking of session stale should be re-upped here.

Let users pass options to httr::GET/POST/PUT/DELETE through ...

The motivating example is timeout(), which users may occasionally need to specify.

Some initial feedback

curl options - I'd suggest using ... or similar to allow curl options to be passed
into GET/POST/etc.
authentication: I haven't dug into this, but i imagine passing in the
session bit to each function call makes sense more so if there are multiple
potential accounts a user could have, while if a user will only ever have
one, then perhaps the auth function could be run once by a user, and then
run internally within function calls so the user doesn't have to worry about
it

looking through this more, but these are the first two things...

item_upload_create only posts the first file

when using multi-file post

create "item_exists(scheme, type, key)" function

session_check_reauth doesn't return a session object

seems to break the pattern of being able to pass around sessions.

Thoughts @lawinslow ?

item_remove_file method

not the upsert I'm after in #37 , but simpler to implement and useful for other things. @lawinslow thoughts on whether this is worth adding?

wrap RCurl error for outdated token

make this make more sense to people that their session is out of date:

Error in RCurl::curlPerform(curl = handle$handle, .opts = curl_opts$values) : 
  Stale CURL handle being passed to libcurl

Need item_list_parent()

reverse of item_list_children()

Fix multiple appended identifiers

See this example:
https://www.sciencebase.gov/catalog/item/editForm/54890a0be4b0772072e5a653

JSON parsing BS

Create basic user vignette

@mhines-usgs any interest in helping out with this?

... applies to multiple internal GET/PUT calls in item_update_identifier

as I'm adding dots arguments today to pass to curl, this is the first case i've seen where the same dots have to be passed to multiple functions - in this case, within item_update_identifier, the dots go to both query_item_identifier and to sbtools_PUT. I'm introducing this oddity because I don't know how else to handle the dots - should we be accepting two different well-formed, single-item config lists rather than allowing the user to pass in the info in standard curl dots format?

item_file_download should download all item files when names is missing

currently fails when name is missing

Fix error handling when item has no files attached

item_file_download fails cryptically, as does item_list_files

New disclaimer

We need to adjust the language of the disclaimer on GRAN to:

.onAttach <- function(libname, pkgname) {
  packageStartupMessage("This information is preliminary or provisional and is subject to revision. It is being provided to meet the need for timely best science. The information has not received final approval by the U.S. Geological Survey (USGS) and is provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the information.")
}

For "create" methods, need to return a fail for unauthorized

for example, item_upload_create() will return NULL if you aren't authorized to perform that action

item_append_files wipes out identifier

library(sbtools)

myFolder <- "550057b9e4b02419550fa5f7"
session <- authenticate_sb(username = "xxx")

folderID <- item_create(myFolder, 
                        title="Test Workflow",
                        session=session)

fileStuff <- item_append_files(folderID,
                               files = "fluxBiasMulti.pdf",
                               session = session
)

x <- item_update_identifier(folderID, 'test', 'workflow', "Unique thing", session )
# This wipes out the identifier:
fileStuff <- item_append_files(folderID, files = "multiPlotDataOverview.pdf", session = session)

I can add:

x <- item_update_identifier(folderID, 'test', 'workflow', "Unique thing", session )

after each item_append_files...but sometimes I get an error that there already is an identifier...presumably adding in a lag might prevent that.

use local session timeout before hitting re-auth service

local timeout of 55 minutes, as a pkg.env var