Code Monkey home page Code Monkey logo

npsutils's Introduction

Lifecycle: experimental CodeFactor

NPS Data Store Utilities - NPSutils

This package is a collection of functions to acquire metadata and data from the National Park Service Data Store. Please request enhancements and bug fixes through Issues.

These functions are under active development and we apologize for any that are borked.

Install from GitHub

# install.packages("devtools")
devtools::install_github("nationalparkservice/NPSutils")

NPSutils is also available as part of the NPSdataverse

# install.packages("devtools")
devtools::install_github("nationalparkservice/NPSdataverse")

npsutils's People

Contributors

juddpatterson avatar roblbaker avatar wright13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

npsutils's Issues

add 'force' option to run get_data_packages

Problem: when scripting, functions that require user input, have user prompts, and provide feedback to the user by printing to the console can cause issue.

Add a "force" option that turns all that stuff off.

A 'verbose' option that turns off just console feedback, but would still require user input when asking things like should a data pacakge be re-downloaded and overwrite the existing files (if they exist).

This feature would parallel a lot of the 'force' options implemented in EMLeditor.

getDataPackage "Secure = TRUE" bug

The getDataPackage() function with Secure = TRUE is supposed to use data services available to NPS internal staff only. The function fails when the reference metadata is set to Internal (e.g. https://irma.nps.gov/DataStore/Reference/Profile/2287257 as of 10/18/2021).

Digging deeper there is a call to the public REST (https://irmaservices.nps.gov/datastore/v4/rest/reference/) that needs to be replaced with a call to the internal REST (https://irmaservices.nps.gov/datastore-secure/v4/rest/reference/). I thought this was a quick fix, but then the rest of the function fails. It appears that using httr::content(httr::GET(RestHoldingInfoURL)) to call the internal REST results in a 401. The same URL in my browser is fine, so this appears to be an issue with httr not passing internal credentials. Some deeper digging will be required to understand options.

Sarah comments

Note: some of these comments may be redundant with existing issues

  • Functions need to be exported
  • If secure connection to IRMA fails, remind user to get on NPS network
  • What's the scope of this package? Right now it's essentially data pkg utils but I can think of LOTS of functions that could fall under the "NPS utils" umbrella.
  • get.DataPackage
    • Should accept destination dir as an argument (default to here::here("data", ReferenceID))
    • How do we expect data package holdings to be structured? Consider searching holding info for zip file with "datapackage" in the name so it doesn't break if other files are present
    • Need to troubleshoot secure option
  • getUnitCode
    • What does "xmlns: URI NRPC.IrmaServices.Rest.Unit is not absolute" mean? If this is useful info, consider providing some context, otherwise, omit it?
  • get.parkCode
    • Is this useful? Only returns National Parks (not nat'l monuments, etc).
  • The functions in getParkUnitInfo seem redundant - can we just use get.unitInfo?
    • If we want to keep getUnitCode, maybe just have it return a named vector where names are unit names and values are unit codes?
      • If we do this, consider adding getUnitName() function to go the other direction (code -> unit name)
  • get.unitInfo
    • Default LifeCycle to "Active"
    • Consider using match.arg to validate some args
    • Consider modifying to accept vectors of length > 1 as filters (e.g. Code = c("JOTR", "DEVA"))
  • get.RefInfo
    • Is there a reason this subset of reference info was chosen to be made available (as opposed to things like copyright, files/links, bibliography, etc)?
    • Would it be better to just return all ref info as a (tidied) list instead of specifying the field?
  • load.dataPackage
    • What if we load metadata inside this function and use that to set dataframe column types? I have some code that does this. It's probably worth returning the metadata as well.
    • I recommend accepting the data package dir as an argument instead of holding ID. It would be useful to have get.DataPackage return the dir it wrote the data package to so that users can call get.DataPackage %>% load.dataPackage
  • load.dataPackageList
    • What's the intended use of this fxn?
  • validate.datapackage
    • Should this live here or in DPchecker?

getDataPackage: alert if data have been versioned

Data packages will often have newer versions. Users are alerted to this on the DataStore reference landing page. However, if the data are directly downloaded using getDataPackages(), the user is not informed:

  1. Alert user if a newer version of the data package exists
  2. Inform user of newer data package DS reference ID
  3. Ask if the newest version should be used instead?

loadDataPackage() refactoring

This function loads in both the metadata and the data file. A couple things could be improved:

  1. Use the existing getAttributes function rather than repeating the metadata/attribute code again. That will ensure only one location needs to be edited in the future.
  2. When a CSV is loaded the column names are derived from the XML/metadata file. If the CSV has column names those should probably be loaded directly and then compared to the XML/metadata columns as a congruency test. Right now a mismatch between the data file and metadata (e.g. wrong column order) could go undetected (and introduce an error).

Data package download progress

Any way to show a progress bar when using getDataPackage or loadDataPackage? The underlying download.file function has a quiet=FALSE. This may not be important for small files, but I'm currently downloading a 1.5GB raster data package and some sort of progress would be useful.

@joe-devivo I'm logging these enhancement requests for my own purposes. Waiting on IT to finish my account elevation so I can install a GIT tool and begin tinkering directly. ๐Ÿ˜Ž

removeDataPackage function (or equiv. parameter for existing function)

By default getDataPackage() leaves both the zipped and unzipped data packages on the file system of the computer. A cleanup function could be run as needed to avoid cluttering up the file system.

For example, someone could run these in sequence:

  • getDataPackage(1234567)
    -- to download a data package from Data Store
  • loadDataPackage(1234567, dataFormat = "CSV")
    -- to load that data package into a data frame
  • getAttributes(1234567, metadataFormat = "EML")
    -- to load attributes from the metadata into a data frame
  • removeDataPackage(1234567)
    -- to delete the zip and unzipped files since everything needed is now in memory

update getDataPackages(): bad DS ref id

Update getDataPackages() to return a more informative error when a bad or non-existent DataStore reference ID is supplied. Currently:

getDataPackages(1, secure = FALSE)
For 1 An error has occurred.
Please re-run get_data_package() and set secure=TRUE.
Don't forget to log on to the VPN!
[1] "C:/Users/rlbaker/Documents/RDev/NPSutils_devspace"

getDataPackages(1, secure = TRUE)
ERROR: You do not have permissions to access 1.
Try logging on to the NPS VPN before running get_data_package().
Function terminated.
Error in value[3L] :
[1] "C:/Users/rlbaker/Documents/RDev/NPSutils_devspace"

Extract info from EML and put it in a dataframe for PowerBI integration

Power BI has an R-scripting interface for importing data using R. However, it only imports dataframes. This works well for CSV/data files but does not work for EML/metadata.

A function is needed to extract info from metadata and put it into a (series of) dataframe(s) for Power BI import.

A first pass at this function would generate a dataframe that includes:
-data file name(s)
-data column names/attribute names
-attribute data types
-attribute definitions

Simple metadata analyses functions

Write a very simple set of functions for first-pass metadata analyses. Leverage common darwinCore column headers. Things to consider:

  1. How many (unique) species are there in 1-N data packages?
  2. How many locations are involved in 1-N data packages (could get trickier with overlapping locations)
  3. How many park units were data collected from in 1-N data packages?
  4. What are the start/end dates of data collection across 1-N data packages?
  5. ???? (open to suggestions)

Download Errors

Need to catch and properly respond to common download errors that may occur. Currently the getDataPackage() function returns an error when an invalid reference is provided, and an empty folder is left on the file system.
Error: $ operator is invalid for atomic vectors

I have not tested what happens if a reference:

  1. does not have a zip file
  2. has more than one zip file

prompt user to overwrite data packages

Add a parameter or yes/no input to understand if an existing data package should be overwritten. Right now the download/unzip occurs again even if the file already exists.

getAttributes fails for complex data packages

getAttributes fails if a data package has more than one csv/xml combo.

Example:

getDataPackage(2285123, Secure = FALSE)
getAttributes(2285123, "csv", "eml")
Error: 'data/raw/2285123/c("HY2019_SFCN_diatoms_TP_metadata.xml", "HY2019_SFCN_vegetation_metadata.xml")' does not exist in current working directory ('C:/Judd/R/DSTools-master/DSTools-master').

The two xml files referenced above do exist in that location, but the combined vector of "both at once" produces the error. The function needs to be improved to handle more than one file check or a parameter added so the function only attempts to operate on a single data file.

Clear 'raw' folder when redownloading package

The results are messy if a package is downloaded and unzipped (via getDataPackage), the source package changes on Data Store, and the getDataPackage is used again. It would seem to make sense to clear out the 'raw' folder for a particular data package if getDataPackage is run more than once. Be sure to consider this issue in context with #6.

get_data_packages timeout at 60 seconds

get_data_packages() default timeout is 60 seconds. This is insufficient for larger files. Increase to 5 min (300 seconds). This should allow downloads of individual files up to 70Mb without problems.

get_data_package requested enhancements:

  1. Take in a list of data packages rather than just one at a time.
  2. If secure=TRUE fails, prompt user to get on VPN
  3. After looking for/unzipping files, delete the .zip file to cut down on clutter (and print out that this has occurred)
  4. return the directories where data packages were downloaded to enable piping with load_package functions.
  5. default to saving data packages to working directory; give users options to specify a different directory

Use metadata to correctly call data types when loading data into a dataframe/tibble

When loading data into a dataframe or tibble, R guesses the data type based on the first several lines of data. This can sometimes be incorrect, requiring careful checking on the part of data consumers or can lead to downstream data analysis errors if it is not checked. Incorrect guesses need to be manually corrected.

We can use the metadata to correctly assign data types to data when they are loaded into a dataframe or tibble, thus reducing the amount of checking and correcting needed on the part of data consumers as well as reducing errors in data analysis.

Message upon completion of download/unzip

After a data package has been downloaded and unzipped successfully I'd suggest a message to the user with the file location. Instead of base R message() that will show up (annoyingly) in red, I'd go with rlang::inform() which will print in black text.

Enable DataStore search from R

Write a function that allows a user to query DataStore directly from R (take advantage of existing DS search tools via API calls).

This will allow the user to pipe the query into the get_data_package and load_data_package functions and ideally facilitate metadata analyses.

Create function to determine metadata format

Right now most of the functions require the user to know the format of the metadata file inside the data package. We should create a function that finds the metadata file automatically and determines its type (FGDC, EML, or ISO). Once this exists it can be tied into some of the existing functions to remove a required parameter.

loadDataPackage case tolerance

Data and metadata inputs on the loadDataPackage() function are case-sensitive. I would suggest converting the user input to lowercase using tolower() to make this more tolerant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.