iferres / mlstar Goto Github PK
View Code? Open in Web Editor NEWAn easy way of MLSTyping your genomes in R.
License: MIT License
An easy way of MLSTyping your genomes in R.
License: MIT License
Hi,
I noticed an issue in the blast processing function that occurs when the query sequence contains only a subsequence of a target that matches equally well to multiple allele sequences in the blast database. Then MLSTar will pick the first line of the blast result as the allele.
In the below example, the 5'-end of the query contig contains 401nt that match identically to the 3'-end of multiple allele sequences and MLSTar will return allele number 98 for this locus.
sample locus qseqid sseqid nident pident mismatch gaps length qstart qend sstart send slen bitscore qseq
1 MySample EFAU004… Contig_… 98 401 100 0 0 401 1 401 725 1125 1125 741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
2 MySample EFAU004… Contig_… 96 401 100 0 0 401 1 401 725 1125 1125 741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
3 MySample EFAU004… Contig_… 95 401 100 0 0 401 1 401 725 1125 1125 741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
4 MySample EFAU004… Contig_… 94 401 100 0 0 401 1 401 725 1125 1125 741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
5 MySample EFAU004… Contig_… 93 401 100 0 0 401 1 401 725 1125 1125 741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
I think it is related to this calculation of blastRes$scov
, which uses query coordinates and should probably just use the subject length:
Line 155 in ef8dcf4
It is an edge case and probably does not happen often, but I suggest it should be handled as an NA allele instead of using just the first allele sequence?
Dear Ignacio,
Great work solving the issue #8 that quick. I guess it would work for everybody.
I am writing here because I am having a problem when downloading cgMLST profiles.
I tried for a simple MLST (7 genes from Staphylococcus aureus):
> downloadPubmlst_profile(org="saureus", scheme=1, dir="DATA/test/PubMLST_scheme1/prf/")
[1] "/home/jsanchez/DATA/test/PubMLST_scheme1/prf/profile_scheme1.tab"
and it worked for me.
But if I try:
> downloadPubmlst_profile(org="saureus", scheme=2, dir="DATA/test/PubMLST_scheme2/prf/")
Error in downloadPubmlst_profile(org = "saureus", scheme = 2, dir = "DATA/test/PubMLST_scheme2/prf/"):
Could not download profile - Invalid input.
I tried to download the sequences using:
downloadPubmlst_seq(org="saureus", scheme=1, dir="DATA/test/PubMLST_scheme1/seq/")
downloadPubmlst_seq(org="saureus", scheme=2, dir="DATA/test/PubMLST_scheme2/seq/")
And both of them worked just perfect.
Finally, I tried with another organism: Salmonella. This one has 4 schemes: 1,3 and 4 are cgMLST and scheme 2 is MLST. I tried:
downloadPubmlst_profile(org="salmonella", scheme=1, dir="DATA/test/PubMLST_scheme2/prf_salmo1/")
downloadPubmlst_profile(org="salmonella", scheme=3, dir="DATA/test/PubMLST_scheme2/prf_salmo2/")
downloadPubmlst_profile(org="salmonella", scheme=4, dir="DATA/test//PubMLST_scheme2/prf_salmo3/")
It only work for me scheme 3 and it downloaded the whole cgMLST profile in just a few minutes. Scheme 1 and 4 failed with the same error output as for Saureus example.
Do you have any idea what is going on? Do you have any suggestion?
Thanks,
Jose
Dear developers,
I would be interested in implementing MLSTar within a pipeline for the identification/genotyping of bacterial isolates from clinical samples and I have a couple of questions and I wonder if you could answer them to me.
Where can I find further information for possible options/arguments to pass to the different functions such as doMLST, etc. I havent found a manual or a description for them.
I would be interested in downloading the pubmlst profile at the beginning for a given bacteria and populate my own profile and include samples that we had previously identified in order to asses relations between different isolates obtained and the database ones. I was wondering if there is a possibility to do this somehow or should I work it out.
Thank you very much in advance
Jose F. Sanchez
Hello!
Is there any intention of expanding this package to have Windows compatibility?
Thanks!
Nimalka
Hi,
I couldnt find "klebsiella pneumoniae" among orgs in listPubmlst() ?
thank you
Hi Ignacio,
First of all, thanks for this package! It's really helpful! (bioinformatics rookie here...)
I don't know if you're aware that the new alleles outputs are sometimes just the forward and reverse versions of the same allele. As a result, the MLSTar outputs per se are not 100% useful when new alleles are identified, as an additional work is needed to be sure it's a unique allele. When there are numerous new alleles like in my case (>300), it's not really straightforward to "pair" those alleles. I've had to check and align them in Geneious to do so.
So, having an extra step that orients the sequences would probably be needed.
I take advantage of this message to also give suggestions:
The code of a newbie like me is probably far from useful for you, but just in case it might help:
`#LOAD LIBRARIES USED FOR FASTA FILE MANIPULATION
library(Biostrings)
library(DECIPHER)
library(seqinr)
#READ FASTA FILE AS A DNASTRINGSET VARIABLE
seqs=list.files(path="/media/sf_Marie/MLST/", pattern=".fasta", recursive = T, full.names = T)
nseqs <- grep(pattern = '/MyGenomesMLST/results*', seqs, value = T) #or whatever name given as fdir in doMLST
output_folder<-"/mygen_new_alleles"
dir.create(paste0(work_dir,output_folder))
for (x in 1:length(nseqs)){
tmp<-readDNAStringSet(filepath=nseqs[x], format = "fasta")
tmp<-OrientNucleotides(tmp)
tmp<-unique(tmp)
len=length(tmp)
names=paste0(gsub(".fasta","",basename(nseqs[x])),"NEW",1:len)
write.fasta(sequences = as.list(paste(tmp)),names=as.list(names),file.out = paste0(work_dir,output_folder,"/",gsub(".fasta","",basename(nseqs[x])),"_new_alleles.fas")) #as.list necessary for some programmes
}
`
Hi @iferres
I am a beginner in genome analysis.Mlstar is a fantastic software to do in silicon mlst and I love the plot function. When I going to do mlst for my E.coli samples, I listed all the profile, but couldn't find the profile for E.coli. So I wonder whether it support mlst for ecoli or not. Thank you.
Dear @iferres,
I have seen that if I provide full path for fdir
within the doMLST
function I get a folder and subfolders generated in the directory where I call the function.
Instead of getting the folder I provide:
fdir = "/home/jsanchez/DATA/MLSTar/example2_test"
I get the folder generated in other path.
/home/jsanchez/DATA/MLSTar/home/jsanchez/DATA/MLSTar/example2_test
I have checked your code and seems ok so I guess if you have any suggestion.
I guess the problem is because it is iteratively generated by dir.create and it generates a folder for each '/' idenfitied. I wonder if you are using this option (recursive=TRUE) for some reason.
I also noticed that if I use Rstudio this does not occur but if I run the same code using Rscript it occurs. I have checked with sessionInfo() and in both examples I am using the same modules, R versions etc.
I wonder if you have any thought or solution. Also, as a possible solution I guess if you could discard stopping the function if fdir already exists or setting recursive to False.
Thank you very much
Jose F Sanchez
In R version 3.3.3 on Mac OS X Mavericks 10.9.5,
using the doMLST() function by pasting into the console printed Error:
Running BLASTN...Error in strsplit(db, "/") : non-character argument
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
It ran without any problem in R version 3.5.0 on macOS High Sierra 10.13.4.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.