tibhannover / bacdiver Goto Github PK

Inofficial R client for the DSMZ's Bacterial Diversity Metadatabase (former contact: @katrinleinweber). https://api.bacdive.dsmz.de/client_examples seems to be the official alternatives.

Home Page: https://TIBHannover.GitHub.io/BacDiveR/

License: MIT License

R 98.05% Makefile 1.95%

api-client bacdive bacteria bacterial bacterial-database bacterial-samples bacteriology biobank diversity microbiology microorganism r r-package rlang rlanguage rstats webservice-client

bacdiver's People

Contributors

Stargazers

Watchers

Forkers

katrinleinweber sebkopf shihuang047 gregpoore ahmadalzdwr scottdaniel

bacdiver's Issues

use more accessible datatype than list to aggregate data

Maybe a list, named with the IDs? Or a character vector, with each element being one list-of-lists / dataset?

split retrieve_IDs off from retrieve_data()

Extract to different function? If yes, by scraping IDs from paged URL returns (official examples), or by storing the URLs as intermediate result, plus providing helper functions to narrow that result down to the IDs?

Or, implement as an internal loop-back in retrieve_data(…, searchType = "taxon") based on new parameter taxon_data = TRUE?

ask, whether only taxon search can return multiple IDs

Write management plans

https://figshare.com/articles/Managing_Research_Software_Development_better_software_better_research/5930662 p24f & http://www.software.ac.uk/software-management-plans

What software will you write?
What will your software do? 
Will your software have a name?
Who are the intended users of your software?
Is for one type of user or for many?
What expertise is required?
How will you make your software available?
How will your software contribute to research and how will you measure its contribution?

optimise performance of list aggregation

https://github.com/tibhannover/BacDiveR/blob/f67dafd7940cde249b4144ce92b6417a29d17064/R/retrieve_data.R#L125-L127

Don't use c() in for loop, but calculate list size from count, then insert elements.

create pkg-down site

http://pkgdown.r-lib.org/

clarify development model: master for integration of feature branches

Currently, I push to master whatever appears to be working. This has to be cleaned up before #9 & #10. Related to #41 & #55.

Thus, ac94212 was a step into the wrong direction.

I should:

write a Release issue template ~~based on usethise::use_release_issue()~~ Too detailled for now.
start development branch
remove rev from ReadMe.md
update contributing docu & templates
activate branch protection to force PR

inform about contribution workflow

Only very rudimentary, and expand after a few contributions have actually occured.

ISSUE_TEMPLATE
PULL_REQUEST_TEMPLATE
CONTRIBUTING

see also https://github.com/kmindi/special-files-in-repository-root for other potentially useful files

randomise test searches

https://bacdive.dsmz.de/api/bacdive/example uses some specific search terms. If automatic tests use these as well, their internal statistics about popular datasets might be skewed. Maybe they are already, and this fact is accounted for by the DSMZ.

ask whether any such statistics are collected

The test search terms could be randomised to avoid this problem: sample(seq(100000, 999999), size = 1), acc <- paste(sample(LETTERS, size = 2), collapse = ""), int), paste("DSM", round(int / 1000)) or similar.

ask for max ranges

This would spread out the "popularity" inflation, but might require fine-tuning the seq ranges. Plus, it would assume continuous numbering on their end.

ask whether this is the case

Sys.chmod(r_env_file, mode = "0600")

So that only the owner/user can read & write it (suggested via email).

remove any kind of guessing or correcting searchType

https://github.com/tibhannover/BacDiveR/blob/master/R/guess_searchType.R is purely for user-convenience, but maybe forcing them to be specific is generally better.

Keep a CHANGELOG

https://keepachangelog.com/en/1.0.0/

I'm not sure about this, because it duplicates info from the commit messages. Maybe it's more useful to focus any explanatory efforts there & auto-generate the CHANGELOG?

create codemeta.json

after #14 & using https://ropensci.github.io/codemetar/#using-the-description-file

same as repo topics
include in makefile

add ORCiD

Example: https://github.com/r-lib/usethis/blob/master/DESCRIPTION#L6-L9

move to other GitHub org

update links in

ReadMe
issues
vignette

Switch retrieve_data() from default searchType = "bacdive_id" to "taxon"?

Because apparently 90% of website users search for a taxonomic group, supporting that use-case in the API client might convince users to query BacDive programmatically.

How many data fields are accessible only through the website?
- doi:10.1093/nar/gkv983: 99 data fields" vs. 233 active data fields.

Error: pandoc document conversion failed with error 83

When running pkgdown::build_site() both with encoding = "latin1" & UTF-8. Same when knit()ing the README.Rmd. This should be fixed in order to enable smooth updating of the GitHub page.

retrieve_data(searchTerm = "DSM 319"…) yields error

x[[1]][1]$url : $ operator is invalid for atomic vectors probably from interaction of

https://github.com/tibhannover/BacDiveR/blob/development/R/retrieve_data.R#L75

with

https://github.com/tibhannover/BacDiveR/blob/development/R/retrieve_data.R#L97-L103

~~try removing Rcurl and hand URL over to fromJSON directly~~
~~pass loop flag into download to switch off gsub for single downloads~~

offer CI to contributors

tutorials: https://blog.rstudio.com/2016/03/09/r-on-travis-ci/ & https://juliasilge.com/blog/beginners-guide-to-travis/
possibly useful tools: https://github.com/craigcitro/r-travis & https://github.com/r-lib/covr
official docu: https://docs.travis-ci.com/user/languages/r/

let users input credentials via prompt, instead of as function parameter

https://github.com/tibhannover/BacDiveR/blob/master/R/utils.R#L10 should be simplified to trigger prompts for missing credentials.

check whether …_Renviron_path code can be consolidated (currently in 4 places; #8)

create CITATION file

After #14 as an alternative or addition to #29?

test citation("BacDiveR")
avoid warning: no date field in DESCRIPTION file of package ‘BacDiveR’
use Zenodo-BibTeX, but with @software, or @misc with type = {software}
test whether a CITATION file supersedes R's citation()-parsing of the DESCRIPTION

implement client-side API call caching

intro: https://www.r-bloggers.com/caching-api-calls-offline/
caching in httr: https://cran.r-project.org/web/packages/httr/httr.pdf & https://rud.is/b/2017/08/22/caching-httr-requests-this-means-warc/
packages: https://github.com/ropensci/mocker & https://github.com/databio/simpleCache

Unless the BacDive server takes over the caching. It could additionally communicate the time-of-last or time-of-next update to support caching.

Refactor retrieve_data()

https://github.com/tibhannover/BacDiveR/blob/master/R/retrieve_data.R#L53-L84 should be more readable & self-explanatory.

~~add comments~~ descriptive names
functionalise

Document architecture decisions

See http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions

Try new folder in https://github.com/tibhannover/BacDiveR/tree/master/docs with .md

rename "aggregate…()" to retrieve… for coherent naming

relationship to ROpenSci/taxize

They don't have BacDive as a source currently. Is BacDiveR 100% redundant, 100% separate, or should some of our functionality maybe be PR'ed to them, while we focus on unique functions here?

ask them about BacDive & PNU

cc @ceb15

Compare temp data to https://zenodo.org/record/1175609

check whether that dataset has different source
- partially from BacDive => reproduce results
write vignette about extracting growth temp from that dataset & through BacDiveR & mention @mengqvist then
- parse his dataset, try to retrieve same species from BacDive

submit to ROpenSci

https://onboarding.ropensci.org/

update README sections
Package categories: data retrieval and/or reproducibility
apply styler, see #73
consider new function names, see #96

Remove invalid \n in JSON

While implementing #31 and switching from rjson to jsonlite I noticed that some fields contain a insufficiently escaped \ns. This results in lexical error: invalid character inside string..

@ceb15: Please consider ensuring that those are escaped as \\n already BacDive or (I presume) during JSON serialisation.

I'll parse them away for now.

prepare demo page

after #23, prepare GitHub Page, maybe based on #13.

get DOI from Zenodo

https://guides.github.com/activities/citable-code/

Because this project was moved to TIBHannover, I had to sign into Zenodo again via GitHub, and grant Zenodo access to the org, then sign-ou and -in again.

Read-only SQL access possible?

Would it be possible then to formulate queries in BacDiveR and download only the queried data fields?

Extract retrieve_IDs() and make retrieve_data(..., force_taxon_download = TRUE)

It seems weird to name the function according to an optional argument, which is itself weirdly named. I want to clarify the API here:

separate retrieve_data() & retrieve_IDs().

Automate build with Makefile

roxygen2::roxygenise()
pkgdown::build_site()
Rscript -e "library('knitr'); knit('README.Rmd')"

aggregate datasets into useful structure before returning

noticed while working on #16

retrieve_data() currently appends multiple downloads into a continuous list in which the datasets can't be addressed anymore. We need a data structure, that lets the user $-address the datasets, and their fields. Ideally, each dataset is referred to by index = bacdive_id. Something like a sparse list-of-lists?!?

ideas:

~~aggregate JSON strings in character vector, then rjson::fromJSON() them "in-place" or somehow that creates the nested lists "below / as lower hierarchies" of that vector~~
~~write-out each dataset to a file (kind of a local cache), then maybe concatenate files & re-import as a useful data structure~~
~~use jsonlite to create 1 dataframe per bacdive_ID, then add those to a list~~
~~keep on c()ombining downloads, but~~ aggregate into a higher-level list and use an apply variant to extract a field/element from the resulting "megastructure"

Acknowledgements section in README

Dear @sckott & @jotech, shall I mention you in some way? For example as in fe22d8d?

convert ReadMe to .Rmd

usethis::use_readme_rmd() to better support examples?

submit to CRAN

https://cran.r-project.org/web/packages/policies.html#Submission

rename get_Renviron_path to construct…

Because it is similar in nature to construct_url() and doesn't GET anything really from anywhere else.

run goodpractice::gp() & apply advice

https://www.rdocumentation.org/packages/goodpractice/versions/1.0.1

incorporate test-relevant data into package

https://github.com/tibhannover/BacDiveR/blob/master/tests/testthat/test-retrieve_data.R#L40-L50 is only testing internal functions, so it doesn't need to be downloaded as well. Best package it up as a dataset within the package

Make taxon search more prominent?

Assuming the vast majority ("90%") of BacDive users looks up data about a strain, backdive_ID as the default search may not be as useful.

Maybe rather a retrieve_taxon_data("…", filter_by = c("property_A", "prop_B", "C")) function?

Test login credentials not for format, but simply against live API?

https://github.com/tibhannover/BacDiveR/blob/ecd1bb4/R/retrieve_data.R#L45 could be used to live-test an email-password combo right after providing it. This would mean less validation code in the plugin, but would put the users at risk of false-negative validation results in case of connection problems, or server-side problems not related to the login credentials.

write vignette

usethis::use_vignette("Bac-Dive-ing-in") 🤓

highlight more meta-info with badges on top of README

https://www.r-bloggers.com/introducing-badgecreatr-a-package-that-places-badges-in-your-readme/

prepare .Renviron for manual config

alternative to #2: don't prompt users within R for email & password, but prepare the target file so they can more easily add both themselves.

rename is_paged() to is_URL_list()

https://github.com/tibhannover/BacDiveR/blob/a2d461dc35fc57c5cdf3b87fb6b2931869a6d507/R/retrieve_data.R#L153-L155

currently checks whether the payload is a list of URLs, not whether that list is actually split into several pages.

Warn very specifically about slow downloads

Instead of the general warning message in case of

https://github.com/tibhannover/BacDiveR/blob/a2d461dc35fc57c5cdf3b87fb6b2931869a6d507/R/retrieve_data.R#L56

always display a progress bar or "x % finished" message.

in case of illegal characters, tell users which

extract the reg-ex to grep(value = TRUE)
adapt to extract only non-[:alnum] & \s
paste() error message together
probably fix test https://github.com/tibhannover/BacDiveR/blob/master/tests/testthat/test-construct_url.R#L4

Add BacDive-IDs to large, returned list

Instead of leaving the index numbers [[1]], [[2]], etc. the datasets should be named according to their bacdive_id. Which is not in the dataset itself, so it might need to be spliced out of the URL again, or retrieved from a previous, ID-containing vector.

Probably needs to be assigned in https://github.com/tibhannover/BacDiveR/blob/master/R/retrieve_data.R#L66-L70

Too many example results are printed

https://github.com/tibhannover/BacDiveR/blob/master/docs/reference/retrieve_data.html#L142

Find an option to not show them.