First I want to say thank you for this package, I'm working on some metagenomic data w

Converting retrieve_data() results to a data frame (tibble) about bacdiver HOT 7 OPEN

jfy133 commented on June 25, 2024

Converting retrieve_data() results to a data frame (tibble)

from bacdiver.

Comments (7)

katrinleinweber commented on June 25, 2024

Cool, thank you for this hint! Having thought about the output format in #28 & #31 a bit already, I'm happy to collect more voices / votes on this, or review a PR to make this output the default.

from bacdiver.

katrinleinweber commented on June 25, 2024

If that NA problem in the reference dataframe (and possibly others) can be solved, that is.

Is your grouping into = c("bacdive_id", "section", "subsection", "field", "key") very specific to your application or data analysis? Or do you consider it general?

from bacdiver.

jfy133 commented on June 25, 2024

I think the reference metadata can be fixed when converting to a table (based on a condition of the object in the cell before un-nesting), but I personally don't need that information at the moment so I didn't invest time in solving it.

The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for separate to spread over, but I never needed more than 4 metadata columns (after the bacdiveid.

I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear.

But in terms of votes, I personally always prefer easily accesible 'tidy' data ;).

Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from tidyverse functions if so (and convert to a tibble after unnesting and separating).

from bacdiver.

katrinleinweber commented on June 25, 2024

Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above %>%-line example ;-)

Looking into these NAs, I find that for example the ID_reference field appears in several nesting "depths":

> str(data_bacdive_raw[["2654"]][["strain_availability"]][["strain_history"]])
'data.frame':	1 obs. of  2 variables:
 $ history     : chr "<- ATCC <- L.DS. Smith, VPI 2488 <- H. Beerens, PCL"
 $ ID_reference: int 626

> str(data_bacdive_raw[["2654"]][["references"]])
'data.frame':	3 obs. of  2 variables:
 $ ID_reference: int  626 20215 20218
 $ reference   : chr  "Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295" "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria "| __truncated__ "Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for"| __truncated__

This causes a "left-/up-ward shift/creep" of the NAs in the tibble:

Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"?

from bacdiver.

jfy133 commented on June 25, 2024

Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself.

Yes, screenshot 2 is exactly what I mean.

I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the

separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%

could be conditional e.g. if the second field of the unlisted string grouped_category matches "references" (like in lines 15-17), this could be separated across just c("bacdive_id", "section", "field").

This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/.

from bacdiver.

jfy133 commented on June 25, 2024

I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines:

## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")

## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>% 
  unlist() %>% 
  bind_rows() %>% 
  gather(grouped_category, value, 1:ncol(.)) %>%
  separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))

## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))

#># A tibble: 144 x 5
#>   bacdive_id section    subsection   field value                                                                                                                           
#>   <chr>      <chr>      <chr>        <chr> <chr>                                                                                                                           
#> 1 2654       references ID_referenc… NA    626                                                                                                                             
#> 2 2654       references ID_referenc… NA    20215                                                                                                                           
#> 3 2654       references ID_referenc… NA    20218                                                                                                                           
#> 4 2654       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295               
#> 5 2654       references reference2   NA    "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654       references reference3   NA    Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758       references ID_referenc… NA    9019                                                                                                                            
#> 8 5758       references ID_referenc… NA    20215                                                                                                                           
#> 9 5758       references ID_referenc… NA    20218                                                                                                                           
#>10 5758       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699   
#> # ... with 134 more rows


## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>% 
  mutate(field = if_else(section == "references", subsection, field),
                  subsection = if_else(section == "references", NA_character_, subsection))

## to show ID_references now correctly not in subsection

data_bacdive_tib %>% filter(is.na(field))

#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>

data_bacdive_tib_fixed %>% filter(is.na(subsection))

## shows ID_references now correctly in field
#># A tibble: 144 x 5
#>   bacdive_id section    subsection field      value                                                                                                                        
#>   <chr>      <chr>      <chr>      <chr>      <chr>                                                                                                                        
#> 1 2654       references NA         ID_refere… 626                                                                                                                          
#> 2 2654       references NA         ID_refere… 20215                                                                                                                        
#> 3 2654       references NA         ID_refere… 20218                                                                                                                        
#> 4 2654       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295            
#> 5 2654       references NA         reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654       references NA         reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758       references NA         ID_refere… 9019                                                                                                                         
#> 8 5758       references NA         ID_refere… 20215                                                                                                                        
#> 9 5758       references NA         ID_refere… 20218                                                                                                                        
#>10 5758       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699           
#># ... with 134 more rows

Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense

from bacdiver.

katrinleinweber commented on June 25, 2024

Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use.

from bacdiver.

Converting retrieve_data() results to a data frame (tibble) about bacdiver HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent