Code Monkey home page Code Monkey logo

Comments (4)

ramirma avatar ramirma commented on July 3, 2024 1

Dear @ramadatta,

Thank you for your inquiry and for your kind words. The current version of chewBBACA has no tool to specifically do what you ask, although this is definitely in our future plans. If you want to have a core gene alignment what you would need to do would be to align the alleles at every locus which you can find in the chewBBACA directory and then concatenate the alleles for each isolate.

Hope this clarified your question.

from chewbbaca_tutorial.

ramadatta avatar ramadatta commented on July 3, 2024

Thanks @ramirma for swift reply. I see your point.

A) Can I just clarify, the approach you explained, seems to be a standalone alignment approach such as using alignment tools such as BLAST and aligning the alleles in the each locus fasta separately on to the genomes and extract the matched genes and form an alignment?

B) But, I am more inclined to looking into the approach to generate the core gene alignment based on allele call tables like this (which is from my understanding available in cgMLST_completegenomes/cgMLST.tsv) as below table. This is because, my standalone BLAST results of allele assignment and chewBBACA's allele assignment for a sample may possibly differ due to difference in the parameters.

FILE    G1      G2      G3      G4      G5      G6
S1      1       INF-2   3       2       1       5
S2      1       1       1       1       NIPH    5
S3      1       2       3       4       1       3
S4      1       LNF     2       4       1       3
S5      1       2       ASM     2       1       3
S6      2       INF-8   3       PLOT3   PLOT5   3 

Therefore, in such case, I am inclined to generate a core gene alignment based on chewBBACA's allele assignment for each sample rather than the results obtained from a standalone tool. Please clarify if my interpretation is correct. Thank you very much!

from chewbbaca_tutorial.

rfm-targa avatar rfm-targa commented on July 3, 2024

Hello @ramadatta,

In your first comment you said that the schema you are using is an external schema. To start using an external schema you should start by running the PrepExternalSchema process to create a version of that schema that is compatible with chewBBACA. This step is necessary to ensure that the schema does not include invalid alleles (chewBBACA allele calling algorithm enforces the condition of complete coding sequences. Sequences with length value that is not a multiple of 3, invalid start or stop codons, that contain ambiguous bases or have internal stop codons are considered to be invalid alleles).

After the schema adaptation process you can perform allele calling with the AlleleCall process to determine the allelic profiles of the samples of interest. If the external schema that you have adapted is a wgMLST schema you will need to run the ExtractCgMLST process to determine the set of loci/genes that constitute the core genome based on the set of samples that you have classified. If the schema you have adapted was already a cgMLST schema you can skip the cgMLST determination step.

The approach you suggest in point B) is the correct approach to generate the core-genome alignment. chewBBACA does not provide functions to generate MSAs for the core genes, but it includes MAFFT and Clustal as dependencies (used for MSA in the SchemaEvaluator process). You can use one of those dependencies to compute MSAs. You should start by creating new FASTA files with the alleles identified in your samples. For each column/gene in the cgMLST.tsv file, you will have to get the alleles DNA sequences from the schema's files and write those sequences to a new FASTA file. The FASTA files that contain the alleles can be found in the schema's directory. Each FASTA file in the schema's directory corresponds to a gene and has all alleles for that gene.

If your cgMLST.tsv file has the following column:

FILE    G1
S1      1
S2      2
S3      1

And the G1.fasta file in the schema's directory has the following structure:

>G1_1
ATGAAA...
>G1_2
ATGTTT...
>G1_3
ATGGGG...

You should get the DNA sequence identified for each sample and generate a FASTA file with the following structure:

>S1_G1_1
ATGAAA...
>S2_G1_2
ATGTTT...
>S3_G1_1
ATGAAA...

After performing this step for all columns/genes in the cgMLST.tsv file you can use MAFFT to compute a MSA for the sequences in each FASTA file. You can concatenate all MSAs to get the core-genome alignment. When you create the FASTA files with the alleles in all samples you can include a header followed by an empty line for LNF, ASM, PLOT, NIPH classifications. For classifications like INF-2 you should get the allele with identifier 2 or remove the INF- prefix.

Let us know if this is enough to clarify any doubts and help you start performing the analysis to get a core-genome alignment.

Best regards,

rfm

from chewbbaca_tutorial.

ramadatta avatar ramadatta commented on July 3, 2024

Hi @rfm-targa . Thank you so much for a comprehensive answer. Your post pretty much clarifies all my questions. Let me read a bit more about the LNF, ASM, PLOT, NIPH classifications and start working towards to generating a core genome alignment. If nothing you can close this issue. Thanks so much again.

from chewbbaca_tutorial.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.