Code Monkey home page Code Monkey logo

Comments (7)

katrinleinweber avatar katrinleinweber commented on June 25, 2024

Cool, thank you for this hint! Having thought about the output format in #28 & #31 a bit already, I'm happy to collect more voices / votes on this, or review a PR to make this output the default.

from bacdiver.

katrinleinweber avatar katrinleinweber commented on June 25, 2024

If that NA problem in the reference dataframe (and possibly others) can be solved, that is.

Is your grouping into = c("bacdive_id", "section", "subsection", "field", "key") very specific to your application or data analysis? Or do you consider it general?

from bacdiver.

jfy133 avatar jfy133 commented on June 25, 2024

I think the reference metadata can be fixed when converting to a table (based on a condition of the object in the cell before un-nesting), but I personally don't need that information at the moment so I didn't invest time in solving it.

The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for separate to spread over, but I never needed more than 4 metadata columns (after the bacdiveid.

I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear.

But in terms of votes, I personally always prefer easily accesible 'tidy' data ;).

Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from tidyverse functions if so (and convert to a tibble after unnesting and separating).

from bacdiver.

katrinleinweber avatar katrinleinweber commented on June 25, 2024

Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above %>%-line example ;-)

Looking into these NAs, I find that for example the ID_reference field appears in several nesting "depths":

> str(data_bacdive_raw[["2654"]][["strain_availability"]][["strain_history"]])
'data.frame':	1 obs. of  2 variables:
 $ history     : chr "<- ATCC <- L.DS. Smith, VPI 2488 <- H. Beerens, PCL"
 $ ID_reference: int 626

> str(data_bacdive_raw[["2654"]][["references"]])
'data.frame':	3 obs. of  2 variables:
 $ ID_reference: int  626 20215 20218
 $ reference   : chr  "Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295" "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria "| __truncated__ "Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for"| __truncated__

screen shot 2018-11-30 at 21 15 14

This causes a "left-/up-ward shift/creep" of the NAs in the tibble:

screen shot 2018-11-30 at 21 13 21

Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"?

from bacdiver.

jfy133 avatar jfy133 commented on June 25, 2024

Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself.

Yes, screenshot 2 is exactly what I mean.

I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the

separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%

could be conditional e.g. if the second field of the unlisted string grouped_category matches "references" (like in lines 15-17), this could be separated across just c("bacdive_id", "section", "field").

This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/.

from bacdiver.

jfy133 avatar jfy133 commented on June 25, 2024

I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines:

## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")

## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>% 
  unlist() %>% 
  bind_rows() %>% 
  gather(grouped_category, value, 1:ncol(.)) %>%
  separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))

## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))

#># A tibble: 144 x 5
#>   bacdive_id section    subsection   field value                                                                                                                           
#>   <chr>      <chr>      <chr>        <chr> <chr>                                                                                                                           
#> 1 2654       references ID_referenc… NA    626                                                                                                                             
#> 2 2654       references ID_referenc… NA    20215                                                                                                                           
#> 3 2654       references ID_referenc… NA    20218                                                                                                                           
#> 4 2654       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295               
#> 5 2654       references reference2   NA    "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654       references reference3   NA    Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758       references ID_referenc… NA    9019                                                                                                                            
#> 8 5758       references ID_referenc… NA    20215                                                                                                                           
#> 9 5758       references ID_referenc… NA    20218                                                                                                                           
#>10 5758       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699   
#> # ... with 134 more rows


## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>% 
  mutate(field = if_else(section == "references", subsection, field),
                  subsection = if_else(section == "references", NA_character_, subsection))

## to show ID_references now correctly not in subsection

data_bacdive_tib %>% filter(is.na(field))

#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>

data_bacdive_tib_fixed %>% filter(is.na(subsection))

## shows ID_references now correctly in field
#># A tibble: 144 x 5
#>   bacdive_id section    subsection field      value                                                                                                                        
#>   <chr>      <chr>      <chr>      <chr>      <chr>                                                                                                                        
#> 1 2654       references NA         ID_refere… 626                                                                                                                          
#> 2 2654       references NA         ID_refere… 20215                                                                                                                        
#> 3 2654       references NA         ID_refere… 20218                                                                                                                        
#> 4 2654       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295            
#> 5 2654       references NA         reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654       references NA         reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758       references NA         ID_refere… 9019                                                                                                                         
#> 8 5758       references NA         ID_refere… 20215                                                                                                                        
#> 9 5758       references NA         ID_refere… 20218                                                                                                                        
#>10 5758       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699           
#># ... with 134 more rows

Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense

from bacdiver.

katrinleinweber avatar katrinleinweber commented on June 25, 2024

Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use.

from bacdiver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.