iferres / mlstar Goto Github PK

View Code? Open in Web Editor NEW

17.0 3.0 10.0 4.47 MB

An easy way of MLSTyping your genomes in R.

License: MIT License

R 100.00%

microbial-genomics r mlst pubmlst genome

mlstar's People

Contributors

Stargazers

Watchers

Forkers

giraola rpucheq nysdan yujimlong microgenlab kuzmenkov111 stephenda antunderwood adewolephoenix wook2014

mlstar's Issues

Choice of target allele with truncated query seq

Hi,

I noticed an issue in the blast processing function that occurs when the query sequence contains only a subsequence of a target that matches equally well to multiple allele sequences in the blast database. Then MLSTar will pick the first line of the blast result as the allele.

In the below example, the 5'-end of the query contig contains 401nt that match identically to the 3'-end of multiple allele sequences and MLSTar will return allele number 98 for this locus.

 sample  locus    qseqid   sseqid nident pident mismatch  gaps length qstart  qend sstart  send  slen bitscore qseq                                                                                   
1 MySample EFAU004… Contig_…     98    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
2 MySample EFAU004… Contig_…     96    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
3 MySample EFAU004… Contig_…     95    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
4 MySample EFAU004… Contig_…     94    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
5 MySample EFAU004… Contig_…     93    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…

I think it is related to this calculation of blastRes$scov, which uses query coordinates and should probably just use the subject length:

MLSTar/R/blastInternals.R

Line 155 in ef8dcf4

(blastRes$qend - blastRes$qstart + 1)

It is an edge case and probably does not happen often, but I suggest it should be handled as an NA allele instead of using just the first allele sequence?

Downloading cgMLST profiles

Dear Ignacio,

Great work solving the issue #8 that quick. I guess it would work for everybody.

I am writing here because I am having a problem when downloading cgMLST profiles.

I tried for a simple MLST (7 genes from Staphylococcus aureus):

> downloadPubmlst_profile(org="saureus", scheme=1, dir="DATA/test/PubMLST_scheme1/prf/")
[1] "/home/jsanchez/DATA/test/PubMLST_scheme1/prf/profile_scheme1.tab"

and it worked for me.

But if I try:

> downloadPubmlst_profile(org="saureus", scheme=2, dir="DATA/test/PubMLST_scheme2/prf/")
Error in downloadPubmlst_profile(org = "saureus", scheme = 2, dir = "DATA/test/PubMLST_scheme2/prf/"): 
  Could not download profile - Invalid input.

I tried to download the sequences using:

downloadPubmlst_seq(org="saureus", scheme=1, dir="DATA/test/PubMLST_scheme1/seq/")
downloadPubmlst_seq(org="saureus", scheme=2, dir="DATA/test/PubMLST_scheme2/seq/")

And both of them worked just perfect.

Finally, I tried with another organism: Salmonella. This one has 4 schemes: 1,3 and 4 are cgMLST and scheme 2 is MLST. I tried:

downloadPubmlst_profile(org="salmonella", scheme=1, dir="DATA/test/PubMLST_scheme2/prf_salmo1/")
downloadPubmlst_profile(org="salmonella", scheme=3, dir="DATA/test/PubMLST_scheme2/prf_salmo2/")
downloadPubmlst_profile(org="salmonella", scheme=4, dir="DATA/test//PubMLST_scheme2/prf_salmo3/")

It only work for me scheme 3 and it downloaded the whole cgMLST profile in just a few minutes. Scheme 1 and 4 failed with the same error output as for Saureus example.

Do you have any idea what is going on? Do you have any suggestion?

Thanks,
Jose

MLSTar further details

Dear developers,
I would be interested in implementing MLSTar within a pipeline for the identification/genotyping of bacterial isolates from clinical samples and I have a couple of questions and I wonder if you could answer them to me.

Where can I find further information for possible options/arguments to pass to the different functions such as doMLST, etc. I havent found a manual or a description for them.
I would be interested in downloading the pubmlst profile at the beginning for a given bacteria and populate my own profile and include samples that we had previously identified in order to asses relations between different isolates obtained and the database ones. I was wondering if there is a possibility to do this somehow or should I work it out.

Thank you very much in advance
Jose F. Sanchez

Windows compatibility

Hello!

Is there any intention of expanding this package to have Windows compatibility?

Thanks!
Nimalka

cant find klebsiella pneumoniae in listPubmlst_orgs()

Hi,
I couldnt find "klebsiella pneumoniae" among orgs in listPubmlst() ?
thank you

New alleles can be forward/reverse duplicates of a single allele

Hi Ignacio,
First of all, thanks for this package! It's really helpful! (bioinformatics rookie here...)

I don't know if you're aware that the new alleles outputs are sometimes just the forward and reverse versions of the same allele. As a result, the MLSTar outputs per se are not 100% useful when new alleles are identified, as an additional work is needed to be sure it's a unique allele. When there are numerous new alleles like in my case (>300), it's not really straightforward to "pair" those alleles. I've had to check and align them in Geneious to do so.
So, having an extra step that orients the sequences would probably be needed.
I take advantage of this message to also give suggestions:

An extra function giving in outputs a list of unique alleles (ready to be submitted to PubMLST) would be great.
Also useful would be to have another function that lists all new STs in a way easy to submit to PubMLST.
Thanks
Marie

The code of a newbie like me is probably far from useful for you, but just in case it might help:
`#LOAD LIBRARIES USED FOR FASTA FILE MANIPULATION
library(Biostrings)
library(DECIPHER)
library(seqinr)

#READ FASTA FILE AS A DNASTRINGSET VARIABLE
seqs=list.files(path="/media/sf_Marie/MLST/", pattern=".fasta", recursive = T, full.names = T)
nseqs <- grep(pattern = '/MyGenomesMLST/results*', seqs, value = T) #or whatever name given as fdir in doMLST
output_folder<-"/mygen_new_alleles"
dir.create(paste0(work_dir,output_folder))

for (x in 1:length(nseqs)){
tmp<-readDNAStringSet(filepath=nseqs[x], format = "fasta")
tmp<-OrientNucleotides(tmp)
tmp<-unique(tmp)
len=length(tmp)
names=paste0(gsub(".fasta","",basename(nseqs[x])),"NEW",1:len)
write.fasta(sequences = as.list(paste(tmp)),names=as.list(names),file.out = paste0(work_dir,output_folder,"/",gsub(".fasta","",basename(nseqs[x])),"_new_alleles.fas")) #as.list necessary for some programmes
}
`

Mlst for E.coli?

Hi @iferres
I am a beginner in genome analysis.Mlstar is a fantastic software to do in silicon mlst and I love the plot function. When I going to do mlst for my E.coli samples, I listed all the profile, but couldn't find the profile for E.coli. So I wonder whether it support mlst for ecoli or not. Thank you.

fdir doMLST bug

Dear @iferres,

I have seen that if I provide full path for fdir within the doMLST function I get a folder and subfolders generated in the directory where I call the function.
Instead of getting the folder I provide:

fdir = "/home/jsanchez/DATA/MLSTar/example2_test"

I get the folder generated in other path.
/home/jsanchez/DATA/MLSTar/home/jsanchez/DATA/MLSTar/example2_test

I have checked your code and seems ok so I guess if you have any suggestion.

I guess the problem is because it is iteratively generated by dir.create and it generates a folder for each '/' idenfitied. I wonder if you are using this option (recursive=TRUE) for some reason.

I also noticed that if I use Rstudio this does not occur but if I run the same code using Rscript it occurs. I have checked with sessionInfo() and in both examples I am using the same modules, R versions etc.

I wonder if you have any thought or solution. Also, as a possible solution I guess if you could discard stopping the function if fdir already exists or setting recursive to False.

Thank you very much
Jose F Sanchez

doMLST() Error on R Console

In R version 3.3.3 on Mac OS X Mavericks 10.9.5,
using the doMLST() function by pasting into the console printed Error:

Running BLASTN...Error in strsplit(db, "/") : non-character argument
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.

It ran without any problem in R version 3.5.0 on macOS High Sierra 10.13.4.

iferres / mlstar Goto Github PK

mlstar's People

Contributors

Stargazers

Watchers

Forkers

mlstar's Issues

Choice of target allele with truncated query seq

Downloading cgMLST profiles

MLSTar further details

Windows compatibility

cant find klebsiella pneumoniae in listPubmlst_orgs()

New alleles can be forward/reverse duplicates of a single allele

Mlst for E.coli?

fdir doMLST bug

doMLST() Error on R Console

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent