y1zhou / brendadb Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 1.41 MB

Load and query the BRENDA database in R.

Home Page: https://bioconductor.org/packages/release/bioc/html/brendaDb.html

License: Other

R 91.52% C++ 8.48%

r database enzyme rpackage brenda hacktoberfest

brendadb's Issues

R 3.6.0?

BiocManager::install("brendaDb", dependencies=TRUE)
Bioconductor version 3.9 (BiocManager 1.30.10), R 3.6.0 (2019-04-26)
Installing package(s) 'brendaDb'

Warning message:
"package ‘brendaDb’ is not available (for R version 3.6.0) "

Can this be resolved? Your package appears to do exactly what I need.

BiocycPathwayGenes() could break when there are multiple Ensembl IDs

Describe the bug

Some gene symbols correspond to multiple Ensembl IDs, and tibble would complain in this scenario.

To Reproduce

Steps to reproduce the behavior:

Run brendaDb::BiocycPathwayGenes(pathway = "TRYPTOPHAN-DEGRADATION-1")
See error:

Found 15 genes in HUMAN pathway TRYPTOPHAN-DEGRADATION-1.
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 15: Columns `BiocycGene`, `BiocycProtein`, `Symbol`
* Length 17: Column `Ensembl`

Expected behavior

Either duplicate values in the other columns, or concatenate Ensembl IDs corresponding to the same gene into one entry.

Reference title and PubMed ID in ExtractField

Is your feature request related to a problem? Please describe.
Only RefIDs are given in the ExtractField() function.

Describe the solution you'd like
Get reference titles and PubMed IDs as well in the returned table.

Describe alternatives you've considered
A separate function (e.g. FetchReference()) that operates on the returned table of ExtractField() and gets the information. This should be doable since all we need is the EC number and the reference ID.

Additional context
NA.

Problems with QueryBrenda()

I am getting problems when using a list of ECs in QueryBrenda().

It appears like all the numbers different from 1.1.1.1 are getting deleted as "invalid EC number(s)".

I tried to manually check on BRENDA if those number are indeed actually invalid but that is not the case (e.g. 1.1.1.262)...

Any inputs on how to solve this issue?

Removing duplicates in nomenclature$protein$description

The description column in the protein table has duplicated organisms because the Uniprot IDs weren't removed from the description column. For example:

df <- ReadBrenda(system.file("extdata", "brenda_download_test.txt",
                             package = "brendaDb"))
x <- QueryBrenda(df, EC = "1.1.1.1")
x$nomenclature$protein[order(x$nomenclature$protein$description), ]

# A tibble: 164 x 5
   proteinID description                     uniprot commentary refID    
   <list>    <chr>                           <chr>   <chr>      <list>   
 1 <chr [1]> Acetobacter pasteurianus        NA      NA         <chr [1]>
 2 <chr [1]> Acinetobacter calcoaceticus     NA      NA         <chr [1]>
 3 <chr [1]> Aeropyrum pernix                NA      NA         <chr [2]>
 4 <chr [1]> Aeropyrum pernix Q9Y9P9 UniProt Q9Y9P9  NA         <chr [3]>
 5 <chr [1]> Alligator mississippiensis      NA      NA         <chr [1]>
 6 <chr [1]> Anastrepha fraterculus          NA      NA         <chr [1]>
 7 <chr [1]> Anastrepha obliqua              NA      NA         <chr [1]>
 8 <chr [1]> Arabidopsis thaliana            NA      NA         <chr [1]>
 9 <chr [1]> Aspergillus nidulans            NA      NA         <chr [1]>
10 <chr [1]> Avena sativa                    NA      NA         <chr [1]>
# … with 154 more rows

Rows 3 and 4 are proteins from the same organism.

Remove column when all NAs

Most of the fieldInfo columns are NAs in the ParseGeneric() function. Removing the column would reduce the size of the brenda.query object, and won't impact the parsing speed significantly (it's already pretty slow).

Option to remove empty tables

Two possible implementations:

Keep current structure for all brenda.query objects, but ignore the NA tables when printing
Remove empty tables when the new argument, e.g. simplify.res = T

Query for specific tables

Now the QueryBrenda function returns all possible fields in the table; a lot of the times we only want information from a certain subset of the fields, e.g. the optimal pH of the enzyme(s).

Missing package vignette

A lot of the text in the readme file could be reused in the package vignette.

Handle non-standard EC numbers

Some edge cases exist for EC numbers in the text file:

Extra parentheses: 1.1.1.286 ()
Transferred EC numbers: 1.1.1.109 (transferred to EC 1.3.1.28)
- Could transfer to multiple: 1.1.1.5 (transferred to EC 1.1.1.303 and EC 1.1.1.304)
Deleted entries: 1.1.1.89 (deleted, included in EC 1.1.1.86)
- Could be longer descriptions: 1.1.1.293 (deleted. This enzyme was already in the Enzyme List as EC 1.1.1.206, tropine dehydrogenase so EC 1.1.1.293 has been withdrawn at the public-review stage.)
- Or just deleted: 6.1.1.8 (deleted)

False compound name output in the description of inhibitors and activate compounds

Sorry for first time writing a issue...

Describe the bug

Content missing(inside the brace) in the compound name output.
Like 2-Butyl-4-[(2,2-dimethyl-1-methylcarbamoyl-propylamino)-hydroxy-methyl]-6-{4'-[(N-methyl-aminooxy)-methyl]-biphenyl-4-yl}-hexanoic acid in origin brenda_download file,
the query output by your package will be:
2-Butyl-4-[(2 2-dimethyl-1-methylcarbamoyl-propylamino)-hydroxy-methyl]-6--hexanoic acid
The content inside the brace will be deleted.

To Reproduce

Steps to reproduce the behavior:

library(brendaDb)
brenda.filepath = DownloadBrenda()
df = ReadBrenda(brenda.filepath)
res=QueryBrenda(df,EC='3.4.24.17',organisms = "Mus musculus")
View(res$`3.4.24.17`$interactions$inhibitors)

Expected behavior

A clear and concise description of what you expected to happen.

One of compound name output in description column should be:
2-Butyl-4-[(2,2-dimethyl-1-methylcarbamoyl-propylamino)-hydroxy-methyl]-6-{4'-[(N-methyl-aminooxy)-methyl]-biphenyl-4-yl}-hexanoic acid

Session Info

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.10.0
LAPACK: /opt/miniconda3/lib/libmkl_intel_lp64.so.1

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=zh_CN.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=zh_CN.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brendaDb_1.7.0  stringr_1.4.0   KEGGREST_1.33.0 reticulate_1.20 biomaRt_2.49.2 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7             lattice_0.20-44        tidyr_1.1.3            prettyunits_1.1.1      png_0.1-7              Biostrings_2.61.1     
 [7] assertthat_0.2.1       digest_0.6.27          utf8_1.2.1             BiocFileCache_2.1.1    R6_2.5.0               GenomeInfoDb_1.29.3   
[13] stats4_4.1.0           evaluate_0.14          RSQLite_2.2.7          httr_1.4.2             pillar_1.6.1           zlibbioc_1.39.0       
[19] rlang_0.4.11           progress_1.2.2         curl_4.3.2             rstudioapi_0.13        blob_1.2.1             S4Vectors_0.31.0      
[25] Matrix_1.3-4           rmarkdown_2.9          BiocParallel_1.27.2    RCurl_1.98-1.3         bit_4.0.4              xfun_0.24             
[31] compiler_4.1.0         pkgconfig_2.0.3        BiocGenerics_0.39.1    htmltools_0.5.1.1      tidyselect_1.1.1       tibble_3.1.2          
[37] GenomeInfoDbData_1.2.6 IRanges_2.27.0         XML_3.99-0.6           fansi_0.5.0            crayon_1.4.1           dplyr_1.0.7           
[43] dbplyr_2.1.1           bitops_1.0-7           rappdirs_0.3.3         grid_4.1.0             jsonlite_1.7.2         lifecycle_1.0.0       
[49] DBI_1.1.1              magrittr_2.0.1         cli_3.0.1              stringi_1.7.3          cachem_1.0.5           XVector_0.33.0        
[55] xml2_1.3.2             ellipsis_0.3.2         filelock_1.0.2         generics_0.1.0         vctrs_0.3.8            tools_4.1.0           
[61] bit64_4.0.5            Biobase_2.53.0         glue_1.4.2             purrr_0.3.4            hms_1.1.0              yaml_2.2.1            
[67] parallel_4.1.0         fastmap_1.1.0          AnnotationDbi_1.55.1   memoise_2.0.0          knitr_1.33

Additional context

Add any other context about the problem here.

Not all UniProt IDs are parsed

Some UniProt IDs in the text file don't follow the standard regex [OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}, and are not detected as of v0.2.3. Some example cases are:

#6# Thauera aromatica o87873 UniProt <6>
#79# Mimosa pudica AB600997 UniProt <125>
#188# Mus musculus Q9wtl4 SwissProt <407>
#59# Candida versatilis A0A14OJW76 UniProt <80>

Describe alternatives you've considered
At least provide an option to remove these fields to reduce the memory taken.

Additional context
None.

y1zhou / brendadb Goto Github PK

brendadb's Issues

Describe the bug

To Reproduce

Expected behavior

Describe the bug

To Reproduce

Expected behavior

Session Info

Additional context

Recommend Projects

Recommend Topics

Recommend Org