nanxstats / rcpi Goto Github PK
View Code? Open in Web Editor NEW๐ Molecular informatics toolkit with integration of bioinformatics and cheminformatics tools for drug discovery
Home Page: https://nanx.me/Rcpi/
License: Artistic License 2.0
๐ Molecular informatics toolkit with integration of bioinformatics and cheminformatics tools for drug discovery
Home Page: https://nanx.me/Rcpi/
License: Artistic License 2.0
Hello,
I used many time the convMolFormat function with great success. Thank you again for this useful package.
Right now, I am using it with a bench of input (10000 mol file) coming from a commercial predictive in-silico tool from Bruker. I wanted to generate a smile table to match the smile for further comparison with other data.
However, in some point (after 520 loops), I get the message "Too many open files". So I tried the common advice given in some forum which is closeAllConnections(). It seams that is not where came from the problem. I check with showConnections(all=TRUE) and only 0,1,2 which are standard connections are open.
I will really appreciate any idea to debug this.
Below the dummy code to see the problem if necessary
Thank you very much
Boris
## get file path
fns <- list.files(fdir[i],pattern=".mol$",full.names = TRUE)
for (j in 1:length(fns)) # mol loop
{
# read mol file or other drawing file
convMolFormat(infile= fns[j], outfile= 'temp.smi'
, from='mol', to='smiles')
# read smile text
t.smile <- readMolFromSmi(smifile='temp.smi', type = "text")
## then I put t.smile in a data frame to latter save it
}
id = c('P00750', 'P00751', 'P00752')
getFASTAFromUniProt(id)
#gives "" "" ""
from R documentations of getFASTAFromUniProt does not run properly. the second line gives an empty character string, which will result in errors in other functions dependent on getFASTAFromUniProt such as getSeqFromUniProt.
Also,
id = c('P00750', 'P00751', 'P00752')
getSeqFromUniProt(id)
#ERROR
leads to error.
(Error in FUN(X[[i]], ...) : no line starting with a > character found)
which is obviously resulted from problems in getFASTAFromUniProt.
It seems that getURLAsynchronous in the implementation of getFASTAFromUniProt is causing the problem somehow. It's better to replace it with some equivalent base-R function.
I want to calculate volume of molecules . Its not working. and giving NA values for all.
I am trying to get molecular volume. But I am getting all values as NA from your package.Is there any way to fix it?
#biomedR/Rcpi
mols <- parse.smiles(pep[1:30,SMILES])
dat = extrDrugVABC(mols)
#head(dat)
head(dat)
VABC
NC@@(C)C(=O)NC@@(C)C(=O)NC@@(C)C(=O)NC@@(CC(=O)N)C(=O)NC@@(CCC(=O)N)C(=O)NC@@(CCSC)C(=O)NC@@(CCCNC(=N)N)C(=O)O NA
NC@@(C)C(=O)NC@@(C)C(=O)NC@@(CC(=O)O)C(=O)NC@@(CC(=O)O)C(=O)NC@@(C@(O)C)C(=O)NC@@(CC(=CN2)C1=C2C=CC=C1)C(=O)NC@@(CCC(=O)O)C(=O)N1C@@(CCC1)C(=O)NC@@(Cc1ccccc1)C(=O)NC@@(C)C(=O)NC@@(CO)C(=O)NCC(=O)NC@@(CCCCN)C(=O)O NA
NC@@(C)C(=O)NC@@(C)C(=O)NC@@(Cc1ccccc1)C(=O)NCC(=O)NC@@(CCC(=O)N)C(=O)NCC(=O)NC@@(CO)C(=O)NCC(=O)N1C@@(CCC1)C(=O)NC@@(C@(CC)C)C(=O)NC@@(CCSC)C(=O)NC@@(CC(C)C)C(=O)NC@@(CC(=O)O)C(=O)NC@@(CCC(=O)O)C(=O)NC@@(C(C)C)C(=O)NC@@(CCC(=O)N)C(=O)NC@@(CS)C(=O)NC@@(C@(O)C)C(=O)NCC(=O)NC@@(C@(O)C)C(=O)NC@@(CCC(=O)O)C(=O)NC@@(C)C(=O)NC@@(CO)C(=O)NC@@(CC(C)C)C(=O)NC@@(C)C(=O)NC@@(CC(=O)O)C(=O)NC@@(CS)C(=O)NC@@(CCCCN)C(=O)O NA
NC@@(C)C(=O)NC@@(C)C(=O)NC@@(Cc1ccccc1)C(=O)NC@@(C@(O)C)C(=O)NC@@(CCC(=O)O)C(=O)NC@@(CS)C(=O)NC@@(CS)C(=O)NC@@(CCC(=O)N)C(=O)NC@@(C)C(=O)NC@@(C)C(=O)NC@@(CC(=O)O)C(=O)NC@@(CCCCN)C(=O)O NA
NC@@(C)C(=O)NC@@(C)C(=O)NC@@(Cc1ccccc1)C(=O)NC@@(C@(O)C)C(=O)NC@@(CCC(=O)O)C(=O)NC@@(CS)C(=O)NC@@(CS)C(=O)NC@@(CCC(=O)N)C(=O)NC@@(C)C(=O)NC@@(C)C(=O)NC@@(CC(=O)O)C(=O)NC@@(CCCCN)C(=O)NC@@(C)C(=O)NC@@(C)C(=O)NC@@(CS)C(=O)NC@@(CC(C)C)C(=O)NC@@(CC(C)C)C(=O)N1C@@(CCC1)C(=O)NC@@(CCCCN)C(=O)O NA
NC@@(C)C(=O)NC@@(C)C(=O)NC@@(C@(CC)C)C(=O)NC@@(CCC(=O)N)C(=O)NC@@(C)C(=O)NC@@(CC(C)C)C(=O)NC@@(CCCNC(=N)N)C(=O)O NA
Hi,
the function getSmiFromPubChem returns an empty string ("") when the id is a singular string, but when the id is a list of string, it works.
Hi, I am learning the Rcpi package, I was successfully able to install and run the package, and I attempting to replicate section 3.4 (Structure-Based Chemical Similarity Searching), I have loaded mol, and moldb files, DB00530.sdf and tyrphostin.sdf respectively, but when I run drug similarity search using code from the tutorial:
rank1 = searchDrug(
mol, moldb, cores = 4, method = "fp",
fptype = "maccs", fpsim = "tanimoto")
I encounter the error:
Error in order(..., decreasing = decreasing) :
unimplemented type 'list' in 'orderVector1'
I did some searching but I don't understand why this error occurs, any help would be appreciated, thanks
Hi there,
I'm trying to calculate fingerprints for ~50,000 molecules. However, I notice that the RAM usage only increases, to the point of completely depleting it. I don't understand how it is possible given that the matrix created by the function extractDrugOBFP4
to store the fingerprints is previously created, with the correct dimensions. Upon review, the size of the matrix is constant (~1.6gb) in each loop, however, the RAM usage by the R session increases as the loop continues. Furthermore, the process is sequential, molecule by molecule, which should not increase RAM usage.
This is the code of the function, and the section of the function that increases RAM usage. I know that this is the problematic section because when I change it to any vector of size 512 (not the fingerprint returned by ChemmineOB), the process does not consume more ram.
I'm not R expert, any help will be useful.
Thanks, and sorry about my english.
function (molecules, type = c("smile", "sdf"))
{
check_ob()
if (type == "smile") {
if (length(molecules) == 1L) {
molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', molecules, identity)"))
fp = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
}
else if (length(molecules) > 1L) {
fp = matrix(0L, nrow = length(molecules), ncol = 512L)
for (i in 1:length(molecules)) {
molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', molecules[i], identity)"))
###########################################################
####### This is the step which increases RAM usage in each loop step
fp[i, ] = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
###########################################################
}
}
}
else if (type == "sdf") {
smi = eval(parse(text = "ChemmineOB::convertFormat(from = 'SDF', to = 'SMILES', source = molecules)"))
smiclean = strsplit(smi, "\\t.*?\\n")[[1]]
if (length(smiclean) == 1L) {
molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', smiclean, identity)"))
fp = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
}
else if (length(smiclean) > 1L) {
fp = matrix(0L, nrow = length(smiclean), ncol = 512L)
for (i in 1:length(smiclean)) {
molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', smiclean[i], identity)"))
fp[i, ] = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
}
}
}
else {
stop("Molecule type must be \"smile\" or \"sdf\"")
}
return(fp)
}
Hi,
I'm trying to run the whole example script.
Unfortunately, I stacked on the step where we train three classification models
After I run the command:
svm.fit1 <- train(
x1.tr, y.tr,
method = "svmRadial", trControl = ctrl,
metric = "ROC", preProc = c("center", "scale")
)
I get this error message:
Error: Please use column names for x
I'm quite new in programming and I don't know how to resolve this problem.
Can you help me? I will be grateful.
Best regards,
Arek
Some SMILEs break extractDrugLongestAliphaticChain
library(rcdk)
library(Rcpi)
library(magrittr)
"[H]OC1=C2OC(=O)C34C5=C6C7([H])C8=C(C([H])([H])C([H])(C79C([H])([H])C5([H])C(=C([H])C([H])(C%10([H])C([H])([H])C([H])([H])C([H])([H])C%10([H])[H])C([H])([H])C4([H])C%11(OC(=O)C=%12C%11=C([H])C([H])=C([H])C%12C([H])([H])C([H])([H])C([H])([H])N([H])[H])C23C([H])([H])C6([H])[H])C([H])([H])C9([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C%13([H])N8C([H])([H])C%14([H])C%15([H])N(C%16([H])C%17(C([H])([H])C%18(C([H])([H])C%17([H])[H])C([H])([H])C([H])([H])C([H])([H])C%18([H])[H])C([H])([H])C([H])([H])C%15([H])C([H])([H])C1%16[H])C([H])([H])C%13([H])C%14([H])[H]" %>%
parse.smiles() %>% .[[1]] %>%
extractDrugLongestAliphaticChain()
#> Error: segfault from C stack overflow
Then, if you don't run extractDrugLongestAliphaticChain
but run with other random Rcpi
functions, the entire R session crashes
"[H]OC1=C2OC(=O)C34C5=C6C7([H])C8=C(C([H])([H])C([H])(C79C([H])([H])C5([H])C(=C([H])C([H])(C%10([H])C([H])([H])C([H])([H])C([H])([H])C%10([H])[H])C([H])([H])C4([H])C%11(OC(=O)C=%12C%11=C([H])C([H])=C([H])C%12C([H])([H])C([H])([H])C([H])([H])N([H])[H])C23C([H])([H])C6([H])[H])C([H])([H])C9([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C%13([H])N8C([H])([H])C%14([H])C%15([H])N(C%16([H])C%17(C([H])([H])C%18(C([H])([H])C%17([H])[H])C([H])([H])C([H])([H])C([H])([H])C%18([H])[H])C([H])([H])C([H])([H])C%15([H])C([H])([H])C1%16[H])C([H])([H])C%13([H])C%14([H])[H]" %>%
parse.smiles() %>% .[[1]] %>%
extractDrugXLogP()
*** caught segfault ***
address 0x311000006, cause 'memory not mapped'
Traceback:
1: .jcheck()
2: .jcall(dval, "Lorg/openscience/cdk/qsar/result/IDescriptorResult;", "getValue")
3: FUN(X[[i]], ...)
4: lapply(descvals, .get.desc.values, nexpected = length(dnames))
5: eval.desc(molecules, "org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", verbose = !silent)
6: extractDrugXLogP(.)
7: "[H]OC1=C2OC(=O)C34C5=C6C7([H])C8=C(C([H])([H])C([H])(C79C([H])([H])C5([H])C(=C([H])C([H])(C%10([H])C([H])([H])C([H])([H])C([H])([H])C%10([H])[H])C([H])([H])C4([H])C%11(OC(=O)C=%12C%11=C([H])C([H])=C([H])C%12C([H])([H])C([H])([H])C([H])([H])N([H])[H])C23C([H])([H])C6([H])[H])C([H])([H])C9([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C%13([H])N8C([H])([H])C%14([H])C%15([H])N(C%16([H])C%17(C([H])([H])C%18(C([H])([H])C%17([H])[H])C([H])([H])C([H])([H])C([H])([H])C%18([H])[H])C([H])([H])C([H])([H])C%15([H])C([H])([H])C1%16[H])C([H])([H])C%13([H])C%14([H])[H]" %>% parse.smiles() %>% .[[1]] %>% extractDrugXLogP()
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
The first issue may be CDK java issue, but can we do something in the second case to prevent R crash?
Hi Nan,
I come across a problem when using convertMolFormat Function. This function work well when using example data. But the errors occured when I hope to convet from inchi to inchikey. The file was attached. test.zip
The code I used:
convMolFormat(infile = './test.inchi', outfile = 'test.inchikey', from = 'inchi', to = 'inchikey')
The error:
Error in ChemmineOB::convertFormatFile(from = from, to = to, fromFile = infile, :
failed to set 'from' and 'to' formats: inchi inchikey
I would appeciate a lot for your kindness helps
Best,
Zhiwei
Dear Nan:
I am trying with Rcpi to calculate descriptors for my 160 small molecules.
However, my R seems like hanging there. I checked your manual and I did use the 3D structures.
Your test dataset of OptAA3d.sdf seems no problem.
Shall I do further clean up of structures? I prepared my sd file from ChemFinder and convert to 3D using chemAxon.
Please kindly suggest,
Xiannghui
Hi
I got this error when I tied to rerun your tutorial on Rcpi
drugseq <- getSmiFromKEGG(drugid, parallel = 5)
java.lang.NullPointerException
at org.guha.rcdk.util.Misc.loadMolecules(Misc.java:169)
Error in load.molecules(tmpfile) :
org.openscience.cdk.exception.CDKException: java.lang.NullPointerException
In the documentation, under
3.1 Regression Modeling in QSRR Study of Retention Indices
There appears to be an error in the following code:
library("Rcpi")
RI.smi = system.file(
"vignettedata/FDAMDD.smi", package = "Rcpi")
RI.csv = system.file(
"vignettedata/RI.csv", package = "Rcpi")
Shouldn't the first file to be loaded be RI.smi instead of FDAMDD.smi? The train step does not work otherwise.
Best wishes.
Hi, thank you for the package. However I encountered an issue as follows:
library("Rcpi")
id = c('7847562', '7847563') # Penicillamine
getSmiFromPubChem(id)
and I got this error
Error in FUN(X[[i]], ...) : argument 'x' must be a raw vector
May I clarify?
I installed Rcpi with the following:
source("http://bioconductor.org/biocLite.R")
biocLite("Rcpi")
and I get error there is no package called โRcpiโ??
The example returns an empty character string
id = c('P00750', 'P00751', 'P00752')
getFASTAFromUniProt(id)
[1] "" "" ""
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rcpi_1.17.2 bindrcpp_0.2.2 knitr_1.20 gridExtra_2.3 umap_0.2.0.0 Rtsne_0.15
[7] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8 purrr_0.2.5 readr_1.1.1 tidyr_0.8.2
[13] tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] nlme_3.1-137 bitops_1.0-6 lubridate_1.7.4 bit64_0.9-7
[5] doParallel_1.0.14 httr_1.3.1 rprojroot_1.3-2 tools_3.5.1
[9] backports_1.1.2 R6_2.3.0 DT_0.5 DBI_1.0.0
[13] lazyeval_0.2.1 BiocGenerics_0.28.0 colorspace_1.3-2 withr_2.1.2
[17] tidyselect_0.2.5 bit_1.1-14 compiler_3.5.1 cli_1.0.1
[21] rvest_0.3.2 Biobase_2.42.0 xml2_1.2.0 labeling_0.3
[25] scales_1.0.0 digest_0.6.18 rmarkdown_1.10 XVector_0.22.0
[29] base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 itertools_0.1-3
[33] highr_0.7 htmlwidgets_1.3 rlang_0.3.0.1 readxl_1.1.0
[37] rstudioapi_0.8 RSQLite_2.1.1 bindr_0.1.1 jsonlite_1.5
[41] GOSemSim_2.8.0 RCurl_1.95-4.11 magrittr_1.5 GO.db_3.7.0
[45] Matrix_1.2-14 Rcpp_1.0.0 munsell_0.5.0 S4Vectors_0.20.1
[49] reticulate_1.10 stringi_1.2.4 yaml_2.2.0 zlibbioc_1.28.0
[53] plyr_1.8.4 blob_1.1.1 parallel_3.5.1 crayon_1.3.4
[57] rcdklibs_2.0 lattice_0.20-35 Biostrings_2.50.1 haven_1.1.2
[61] hms_0.4.2 pillar_1.3.0 rjson_0.2.20 codetools_0.2-15
[65] reshape2_1.4.3 stats4_3.5.1 ChemmineR_3.34.1 rcdk_3.4.7.1
[69] glue_1.3.0 evaluate_0.12 modelr_0.1.2 foreach_1.4.4
[73] png_0.1-7 cellranger_1.1.0 gtable_0.2.0 assertthat_0.2.0
[77] broom_0.5.0 rsvg_1.3 rJava_0.9-10 fingerprint_3.5.7
[81] iterators_1.0.10 AnnotationDbi_1.44.0 memoise_1.1.0 IRanges_2.16.0
[85] fmcsR_1.24.0
RI2.smi contains only one SMILE (the first line of vignettedata/RI.smi):
line 1: CCCCCCCCCCCCCCCCCCCCCCC
We got:
smi = system.file('vignettedata/RI2.smi', package = 'Rcpi')
mol = readMolFromSmi(smi, type = 'mol')
fp = extractDrugKR(mol)
Error in get.fingerprint(molecules, type = "kr", verbose = !silent) :
Must supply an IAtomContainer or something coercable to it
Env:
R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Platform: x86_64-pc-linux-gnu (64-bit)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.