Code Monkey home page Code Monkey logo

mlstar's People

Contributors

antunderwood avatar giraola avatar iferres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mlstar's Issues

Choice of target allele with truncated query seq

Hi,

I noticed an issue in the blast processing function that occurs when the query sequence contains only a subsequence of a target that matches equally well to multiple allele sequences in the blast database. Then MLSTar will pick the first line of the blast result as the allele.

In the below example, the 5'-end of the query contig contains 401nt that match identically to the 3'-end of multiple allele sequences and MLSTar will return allele number 98 for this locus.

 sample  locus    qseqid   sseqid nident pident mismatch  gaps length qstart  qend sstart  send  slen bitscore qseq                                                                                   
1 MySample EFAU004… Contig_…     98    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
2 MySample EFAU004… Contig_…     96    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
3 MySample EFAU004… Contig_…     95    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
4 MySample EFAU004… Contig_…     94    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…
5 MySample EFAU004… Contig_…     93    401    100        0     0    401      1   401    725  1125  1125      741 AACAATTTCTCGCACTTTTGAACCAAAATAAAAAGAAAGACCTTTTT…

I think it is related to this calculation of blastRes$scov, which uses query coordinates and should probably just use the subject length:

(blastRes$qend - blastRes$qstart + 1)

It is an edge case and probably does not happen often, but I suggest it should be handled as an NA allele instead of using just the first allele sequence?

Downloading cgMLST profiles

Dear Ignacio,

Great work solving the issue #8 that quick. I guess it would work for everybody.

I am writing here because I am having a problem when downloading cgMLST profiles.

I tried for a simple MLST (7 genes from Staphylococcus aureus):

> downloadPubmlst_profile(org="saureus", scheme=1, dir="DATA/test/PubMLST_scheme1/prf/")
[1] "/home/jsanchez/DATA/test/PubMLST_scheme1/prf/profile_scheme1.tab"

and it worked for me.

But if I try:

> downloadPubmlst_profile(org="saureus", scheme=2, dir="DATA/test/PubMLST_scheme2/prf/")
Error in downloadPubmlst_profile(org = "saureus", scheme = 2, dir = "DATA/test/PubMLST_scheme2/prf/"): 
  Could not download profile - Invalid input.

I tried to download the sequences using:

downloadPubmlst_seq(org="saureus", scheme=1, dir="DATA/test/PubMLST_scheme1/seq/")
downloadPubmlst_seq(org="saureus", scheme=2, dir="DATA/test/PubMLST_scheme2/seq/")

And both of them worked just perfect.

Finally, I tried with another organism: Salmonella. This one has 4 schemes: 1,3 and 4 are cgMLST and scheme 2 is MLST. I tried:

downloadPubmlst_profile(org="salmonella", scheme=1, dir="DATA/test/PubMLST_scheme2/prf_salmo1/")
downloadPubmlst_profile(org="salmonella", scheme=3, dir="DATA/test/PubMLST_scheme2/prf_salmo2/")
downloadPubmlst_profile(org="salmonella", scheme=4, dir="DATA/test//PubMLST_scheme2/prf_salmo3/")

It only work for me scheme 3 and it downloaded the whole cgMLST profile in just a few minutes. Scheme 1 and 4 failed with the same error output as for Saureus example.

Do you have any idea what is going on? Do you have any suggestion?

Thanks,
Jose

MLSTar further details

Dear developers,
I would be interested in implementing MLSTar within a pipeline for the identification/genotyping of bacterial isolates from clinical samples and I have a couple of questions and I wonder if you could answer them to me.

  1. Where can I find further information for possible options/arguments to pass to the different functions such as doMLST, etc. I havent found a manual or a description for them.

  2. I would be interested in downloading the pubmlst profile at the beginning for a given bacteria and populate my own profile and include samples that we had previously identified in order to asses relations between different isolates obtained and the database ones. I was wondering if there is a possibility to do this somehow or should I work it out.

Thank you very much in advance
Jose F. Sanchez

Windows compatibility

Hello!

Is there any intention of expanding this package to have Windows compatibility?

Thanks!
Nimalka

New alleles can be forward/reverse duplicates of a single allele

Hi Ignacio,
First of all, thanks for this package! It's really helpful! (bioinformatics rookie here...)

I don't know if you're aware that the new alleles outputs are sometimes just the forward and reverse versions of the same allele. As a result, the MLSTar outputs per se are not 100% useful when new alleles are identified, as an additional work is needed to be sure it's a unique allele. When there are numerous new alleles like in my case (>300), it's not really straightforward to "pair" those alleles. I've had to check and align them in Geneious to do so.
So, having an extra step that orients the sequences would probably be needed.
I take advantage of this message to also give suggestions:

  • An extra function giving in outputs a list of unique alleles (ready to be submitted to PubMLST) would be great.
  • Also useful would be to have another function that lists all new STs in a way easy to submit to PubMLST.
    Thanks
    Marie

The code of a newbie like me is probably far from useful for you, but just in case it might help:
`#LOAD LIBRARIES USED FOR FASTA FILE MANIPULATION
library(Biostrings)
library(DECIPHER)
library(seqinr)

#READ FASTA FILE AS A DNASTRINGSET VARIABLE
seqs=list.files(path="/media/sf_Marie/MLST/", pattern=".fasta", recursive = T, full.names = T)
nseqs <- grep(pattern = '
/MyGenomesMLST/results*', seqs, value = T) #or whatever name given as fdir in doMLST
output_folder<-"/mygen_new_alleles"
dir.create(paste0(work_dir,output_folder))

for (x in 1:length(nseqs)){
tmp<-readDNAStringSet(filepath=nseqs[x], format = "fasta")
tmp<-OrientNucleotides(tmp)
tmp<-unique(tmp)
len=length(tmp)
names=paste0(gsub(".fasta","",basename(nseqs[x])),"NEW",1:len)
write.fasta(sequences = as.list(paste(tmp)),names=as.list(names),file.out = paste0(work_dir,output_folder,"/",gsub(".fasta","",basename(nseqs[x])),"_new_alleles.fas")) #as.list necessary for some programmes
}
`

Mlst for E.coli?

Hi @iferres
I am a beginner in genome analysis.Mlstar is a fantastic software to do in silicon mlst and I love the plot function. When I going to do mlst for my E.coli samples, I listed all the profile, but couldn't find the profile for E.coli. So I wonder whether it support mlst for ecoli or not. Thank you.

fdir doMLST bug

Dear @iferres,

I have seen that if I provide full path for fdir within the doMLST function I get a folder and subfolders generated in the directory where I call the function.
Instead of getting the folder I provide:

fdir = "/home/jsanchez/DATA/MLSTar/example2_test"

I get the folder generated in other path.
/home/jsanchez/DATA/MLSTar/home/jsanchez/DATA/MLSTar/example2_test

I have checked your code and seems ok so I guess if you have any suggestion.

I guess the problem is because it is iteratively generated by dir.create and it generates a folder for each '/' idenfitied. I wonder if you are using this option (recursive=TRUE) for some reason.

I also noticed that if I use Rstudio this does not occur but if I run the same code using Rscript it occurs. I have checked with sessionInfo() and in both examples I am using the same modules, R versions etc.

I wonder if you have any thought or solution. Also, as a possible solution I guess if you could discard stopping the function if fdir already exists or setting recursive to False.

Thank you very much
Jose F Sanchez

doMLST() Error on R Console

In R version 3.3.3 on Mac OS X Mavericks 10.9.5,
using the doMLST() function by pasting into the console printed Error:

Running BLASTN...Error in strsplit(db, "/") : non-character argument
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.

It ran without any problem in R version 3.5.0 on macOS High Sierra 10.13.4.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.