Comments (7)
Cool, thank you for this hint! Having thought about the output format in #28 & #31 a bit already, I'm happy to collect more voices / votes on this, or review a PR to make this output the default.
from bacdiver.
If that NA
problem in the reference
dataframe (and possibly others) can be solved, that is.
Is your grouping into = c("bacdive_id", "section", "subsection", "field", "key")
very specific to your application or data analysis? Or do you consider it general?
from bacdiver.
I think the reference
metadata can be fixed when converting to a table (based on a condition of the object in the cell before un-nesting), but I personally don't need that information at the moment so I didn't invest time in solving it.
The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for separate
to spread over, but I never needed more than 4 metadata columns (after the bacdiveid.
I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear.
But in terms of votes, I personally always prefer easily accesible 'tidy' data ;).
Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from tidyverse
functions if so (and convert to a tibble after unnesting and separating).
from bacdiver.
Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above %>%
-line example ;-)
Looking into these NA
s, I find that for example the ID_reference
field appears in several nesting "depths":
> str(data_bacdive_raw[["2654"]][["strain_availability"]][["strain_history"]])
'data.frame': 1 obs. of 2 variables:
$ history : chr "<- ATCC <- L.DS. Smith, VPI 2488 <- H. Beerens, PCL"
$ ID_reference: int 626
> str(data_bacdive_raw[["2654"]][["references"]])
'data.frame': 3 obs. of 2 variables:
$ ID_reference: int 626 20215 20218
$ reference : chr "Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295" "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria "| __truncated__ "Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for"| __truncated__
This causes a "left-/up-ward shift/creep" of the NA
s in the tibble:
Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"?
from bacdiver.
Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself.
Yes, screenshot 2 is exactly what I mean.
I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the
separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%
could be conditional e.g. if the second field of the unlisted string grouped_category
matches "references" (like in lines 15-17), this could be separated across just c("bacdive_id", "section", "field")
.
This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/.
from bacdiver.
I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines:
## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")
## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>%
unlist() %>%
bind_rows() %>%
gather(grouped_category, value, 1:ncol(.)) %>%
separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))
## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))
#># A tibble: 144 x 5
#> bacdive_id section subsection field value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2654 references ID_referenc… NA 626
#> 2 2654 references ID_referenc… NA 20215
#> 3 2654 references ID_referenc… NA 20218
#> 4 2654 references reference1 NA Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295
#> 5 2654 references reference2 NA "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654 references reference3 NA Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758 references ID_referenc… NA 9019
#> 8 5758 references ID_referenc… NA 20215
#> 9 5758 references ID_referenc… NA 20218
#>10 5758 references reference1 NA Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699
#> # ... with 134 more rows
## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>%
mutate(field = if_else(section == "references", subsection, field),
subsection = if_else(section == "references", NA_character_, subsection))
## to show ID_references now correctly not in subsection
data_bacdive_tib %>% filter(is.na(field))
#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>
data_bacdive_tib_fixed %>% filter(is.na(subsection))
## shows ID_references now correctly in field
#># A tibble: 144 x 5
#> bacdive_id section subsection field value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2654 references NA ID_refere… 626
#> 2 2654 references NA ID_refere… 20215
#> 3 2654 references NA ID_refere… 20218
#> 4 2654 references NA reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295
#> 5 2654 references NA reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654 references NA reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758 references NA ID_refere… 9019
#> 8 5758 references NA ID_refere… 20215
#> 9 5758 references NA ID_refere… 20218
#>10 5758 references NA reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699
#># ... with 134 more rows
Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense
from bacdiver.
Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use.
from bacdiver.
Related Issues (20)
- Deploy docu with Travis
- Find smaller examples for docu & test them
- clarify development model: GitHub Flow
- Blank search results and not error when password changed but not in Renviron file HOT 3
- bd_retrieve_by_search output HOT 2
- How to download the full database HOT 3
- R devtools installation hangs on testing for >1 hour HOT 4
- Illegal character error for valid taxa HOT 1
- Failed to install the package HOT 2
- get a null result HOT 1
- installation HOT 1
- Error: lexical error: invalid character inside string HOT 7
- retrieve_data: argument is of length zero HOT 1
- Stop or warn if a dataset doesn't exist?
- Convenience functions for downloading entire domain, phylum, class, order & family
- new API for BacDiveR HOT 3
- Replace httr with crul?
- Reduce dependencies
- Simplify prepare_Renviron() with usethis::edit_r_environ()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bacdiver.