Code Monkey home page Code Monkey logo

brendadb's People

Contributors

jwokaty avatar nturaga avatar y1zhou avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

brendadb's Issues

Remove column when all NAs

Most of the fieldInfo columns are NAs in the ParseGeneric() function. Removing the column would reduce the size of the brenda.query object, and won't impact the parsing speed significantly (it's already pretty slow).

Not all UniProt IDs are parsed

Some UniProt IDs in the text file don't follow the standard regex [OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}, and are not detected as of v0.2.3. Some example cases are:

  • #6# Thauera aromatica o87873 UniProt <6>
  • #79# Mimosa pudica AB600997 UniProt <125>
  • #188# Mus musculus Q9wtl4 SwissProt <407>
  • #59# Candida versatilis A0A14OJW76 UniProt <80>

Remove empty fields in brenda.query result

Is your feature request related to a problem? Please describe.
Current queries take a long time because all fields are constructed in the return object even if they are not part of the desired query.

Describe the solution you'd like
Skip fields that shouldn't be queried.

Describe alternatives you've considered
At least provide an option to remove these fields to reduce the memory taken.

Additional context
None.

BiocycPathwayGenes() could break when there are multiple Ensembl IDs

Describe the bug

Some gene symbols correspond to multiple Ensembl IDs, and tibble would complain in this scenario.

To Reproduce

Steps to reproduce the behavior:

  1. Run brendaDb::BiocycPathwayGenes(pathway = "TRYPTOPHAN-DEGRADATION-1")
  2. See error:
Found 15 genes in HUMAN pathway TRYPTOPHAN-DEGRADATION-1.
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 15: Columns `BiocycGene`, `BiocycProtein`, `Symbol`
* Length 17: Column `Ensembl`

Expected behavior

Either duplicate values in the other columns, or concatenate Ensembl IDs corresponding to the same gene into one entry.

Problems with QueryBrenda()

I am getting problems when using a list of ECs in QueryBrenda().

It appears like all the numbers different from 1.1.1.1 are getting deleted as "invalid EC number(s)".

I tried to manually check on BRENDA if those number are indeed actually invalid but that is not the case (e.g. 1.1.1.262)...

Any inputs on how to solve this issue?

Handle non-standard EC numbers

Some edge cases exist for EC numbers in the text file:

  • Extra parentheses: 1.1.1.286 ()
  • Transferred EC numbers: 1.1.1.109 (transferred to EC 1.3.1.28)
    • Could transfer to multiple: 1.1.1.5 (transferred to EC 1.1.1.303 and EC 1.1.1.304)
  • Deleted entries: 1.1.1.89 (deleted, included in EC 1.1.1.86)
    • Could be longer descriptions: 1.1.1.293 (deleted. This enzyme was already in the Enzyme List as EC 1.1.1.206, tropine dehydrogenase so EC 1.1.1.293 has been withdrawn at the public-review stage.)
    • Or just deleted: 6.1.1.8 (deleted)

Reference title and PubMed ID in ExtractField

Is your feature request related to a problem? Please describe.
Only RefIDs are given in the ExtractField() function.

Describe the solution you'd like
Get reference titles and PubMed IDs as well in the returned table.

Describe alternatives you've considered
A separate function (e.g. FetchReference()) that operates on the returned table of ExtractField() and gets the information. This should be doable since all we need is the EC number and the reference ID.

Additional context
NA.

Query for BioCyc pathways

The goal is to input a Biocyc pathway ID (e.g. SERSYN-PWY for serine biosynthesis (phosphorylated route), and return all the enzymes in that pathway, as well as the brenda.query results.

Removing duplicates in nomenclature$protein$description

The description column in the protein table has duplicated organisms because the Uniprot IDs weren't removed from the description column. For example:

df <- ReadBrenda(system.file("extdata", "brenda_download_test.txt",
                             package = "brendaDb"))
x <- QueryBrenda(df, EC = "1.1.1.1")
x$nomenclature$protein[order(x$nomenclature$protein$description), ]

# A tibble: 164 x 5
   proteinID description                     uniprot commentary refID    
   <list>    <chr>                           <chr>   <chr>      <list>   
 1 <chr [1]> Acetobacter pasteurianus        NA      NA         <chr [1]>
 2 <chr [1]> Acinetobacter calcoaceticus     NA      NA         <chr [1]>
 3 <chr [1]> Aeropyrum pernix                NA      NA         <chr [2]>
 4 <chr [1]> Aeropyrum pernix Q9Y9P9 UniProt Q9Y9P9  NA         <chr [3]>
 5 <chr [1]> Alligator mississippiensis      NA      NA         <chr [1]>
 6 <chr [1]> Anastrepha fraterculus          NA      NA         <chr [1]>
 7 <chr [1]> Anastrepha obliqua              NA      NA         <chr [1]>
 8 <chr [1]> Arabidopsis thaliana            NA      NA         <chr [1]>
 9 <chr [1]> Aspergillus nidulans            NA      NA         <chr [1]>
10 <chr [1]> Avena sativa                    NA      NA         <chr [1]>
# … with 154 more rows

Rows 3 and 4 are proteins from the same organism.

Option to remove empty tables

Two possible implementations:

  • Keep current structure for all brenda.query objects, but ignore the NA tables when printing
  • Remove empty tables when the new argument, e.g. simplify.res = T

False compound name output in the description of inhibitors and activate compounds

Sorry for first time writing a issue...

Describe the bug

  1. Content missing(inside the brace) in the compound name output.
    Like 2-Butyl-4-[(2,2-dimethyl-1-methylcarbamoyl-propylamino)-hydroxy-methyl]-6-{4'-[(N-methyl-aminooxy)-methyl]-biphenyl-4-yl}-hexanoic acid in origin brenda_download file,
    the query output by your package will be:
    2-Butyl-4-[(2 2-dimethyl-1-methylcarbamoyl-propylamino)-hydroxy-methyl]-6--hexanoic acid
    The content inside the brace will be deleted.

To Reproduce

Steps to reproduce the behavior:

library(brendaDb)
brenda.filepath = DownloadBrenda()
df = ReadBrenda(brenda.filepath)
res=QueryBrenda(df,EC='3.4.24.17',organisms = "Mus musculus")
View(res$`3.4.24.17`$interactions$inhibitors)

Expected behavior

A clear and concise description of what you expected to happen.

One of compound name output in description column should be:
2-Butyl-4-[(2,2-dimethyl-1-methylcarbamoyl-propylamino)-hydroxy-methyl]-6-{4'-[(N-methyl-aminooxy)-methyl]-biphenyl-4-yl}-hexanoic acid

Session Info

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.10.0
LAPACK: /opt/miniconda3/lib/libmkl_intel_lp64.so.1

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=zh_CN.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=zh_CN.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brendaDb_1.7.0  stringr_1.4.0   KEGGREST_1.33.0 reticulate_1.20 biomaRt_2.49.2 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7             lattice_0.20-44        tidyr_1.1.3            prettyunits_1.1.1      png_0.1-7              Biostrings_2.61.1     
 [7] assertthat_0.2.1       digest_0.6.27          utf8_1.2.1             BiocFileCache_2.1.1    R6_2.5.0               GenomeInfoDb_1.29.3   
[13] stats4_4.1.0           evaluate_0.14          RSQLite_2.2.7          httr_1.4.2             pillar_1.6.1           zlibbioc_1.39.0       
[19] rlang_0.4.11           progress_1.2.2         curl_4.3.2             rstudioapi_0.13        blob_1.2.1             S4Vectors_0.31.0      
[25] Matrix_1.3-4           rmarkdown_2.9          BiocParallel_1.27.2    RCurl_1.98-1.3         bit_4.0.4              xfun_0.24             
[31] compiler_4.1.0         pkgconfig_2.0.3        BiocGenerics_0.39.1    htmltools_0.5.1.1      tidyselect_1.1.1       tibble_3.1.2          
[37] GenomeInfoDbData_1.2.6 IRanges_2.27.0         XML_3.99-0.6           fansi_0.5.0            crayon_1.4.1           dplyr_1.0.7           
[43] dbplyr_2.1.1           bitops_1.0-7           rappdirs_0.3.3         grid_4.1.0             jsonlite_1.7.2         lifecycle_1.0.0       
[49] DBI_1.1.1              magrittr_2.0.1         cli_3.0.1              stringi_1.7.3          cachem_1.0.5           XVector_0.33.0        
[55] xml2_1.3.2             ellipsis_0.3.2         filelock_1.0.2         generics_0.1.0         vctrs_0.3.8            tools_4.1.0           
[61] bit64_4.0.5            Biobase_2.53.0         glue_1.4.2             purrr_0.3.4            hms_1.1.0              yaml_2.2.1            
[67] parallel_4.1.0         fastmap_1.1.0          AnnotationDbi_1.55.1   memoise_2.0.0          knitr_1.33     

Additional context

Add any other context about the problem here.

R 3.6.0?

BiocManager::install("brendaDb", dependencies=TRUE)
Bioconductor version 3.9 (BiocManager 1.30.10), R 3.6.0 (2019-04-26)
Installing package(s) 'brendaDb'

Warning message:
"package ‘brendaDb’ is not available (for R version 3.6.0) "

Can this be resolved? Your package appears to do exactly what I need.

Query for specific tables

Now the QueryBrenda function returns all possible fields in the table; a lot of the times we only want information from a certain subset of the fields, e.g. the optimal pH of the enzyme(s).

Query for specific organisms

Join other tables with the nomenclature.protein table to get the organism, and also the bibliography.reference table to get the references.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.