Comments (4)
Dear @ramadatta,
Thank you for your inquiry and for your kind words. The current version of chewBBACA has no tool to specifically do what you ask, although this is definitely in our future plans. If you want to have a core gene alignment what you would need to do would be to align the alleles at every locus which you can find in the chewBBACA directory and then concatenate the alleles for each isolate.
Hope this clarified your question.
from chewbbaca_tutorial.
Thanks @ramirma for swift reply. I see your point.
A) Can I just clarify, the approach you explained, seems to be a standalone alignment approach such as using alignment tools such as BLAST and aligning the alleles in the each locus fasta separately on to the genomes and extract the matched genes and form an alignment?
B) But, I am more inclined to looking into the approach to generate the core gene alignment based on allele call tables like this (which is from my understanding available in cgMLST_completegenomes/cgMLST.tsv) as below table. This is because, my standalone BLAST results of allele assignment and chewBBACA's allele assignment for a sample may possibly differ due to difference in the parameters.
FILE G1 G2 G3 G4 G5 G6
S1 1 INF-2 3 2 1 5
S2 1 1 1 1 NIPH 5
S3 1 2 3 4 1 3
S4 1 LNF 2 4 1 3
S5 1 2 ASM 2 1 3
S6 2 INF-8 3 PLOT3 PLOT5 3
Therefore, in such case, I am inclined to generate a core gene alignment based on chewBBACA's allele assignment for each sample rather than the results obtained from a standalone tool. Please clarify if my interpretation is correct. Thank you very much!
from chewbbaca_tutorial.
Hello @ramadatta,
In your first comment you said that the schema you are using is an external schema. To start using an external schema you should start by running the PrepExternalSchema
process to create a version of that schema that is compatible with chewBBACA. This step is necessary to ensure that the schema does not include invalid alleles (chewBBACA allele calling algorithm enforces the condition of complete coding sequences. Sequences with length value that is not a multiple of 3, invalid start or stop codons, that contain ambiguous bases or have internal stop codons are considered to be invalid alleles).
After the schema adaptation process you can perform allele calling with the AlleleCall
process to determine the allelic profiles of the samples of interest. If the external schema that you have adapted is a wgMLST schema you will need to run the ExtractCgMLST
process to determine the set of loci/genes that constitute the core genome based on the set of samples that you have classified. If the schema you have adapted was already a cgMLST schema you can skip the cgMLST determination step.
The approach you suggest in point B)
is the correct approach to generate the core-genome alignment. chewBBACA does not provide functions to generate MSAs for the core genes, but it includes MAFFT
and Clustal
as dependencies (used for MSA in the SchemaEvaluator
process). You can use one of those dependencies to compute MSAs. You should start by creating new FASTA files with the alleles identified in your samples. For each column/gene in the cgMLST.tsv
file, you will have to get the alleles DNA sequences from the schema's files and write those sequences to a new FASTA file. The FASTA files that contain the alleles can be found in the schema's directory. Each FASTA file in the schema's directory corresponds to a gene and has all alleles for that gene.
If your cgMLST.tsv
file has the following column:
FILE G1
S1 1
S2 2
S3 1
And the G1.fasta
file in the schema's directory has the following structure:
>G1_1
ATGAAA...
>G1_2
ATGTTT...
>G1_3
ATGGGG...
You should get the DNA sequence identified for each sample and generate a FASTA file with the following structure:
>S1_G1_1
ATGAAA...
>S2_G1_2
ATGTTT...
>S3_G1_1
ATGAAA...
After performing this step for all columns/genes in the cgMLST.tsv
file you can use MAFFT
to compute a MSA for the sequences in each FASTA file. You can concatenate all MSAs to get the core-genome alignment. When you create the FASTA files with the alleles in all samples you can include a header followed by an empty line for LNF
, ASM
, PLOT
, NIPH
classifications. For classifications like INF-2
you should get the allele with identifier 2
or remove the INF-
prefix.
Let us know if this is enough to clarify any doubts and help you start performing the analysis to get a core-genome alignment.
Best regards,
rfm
from chewbbaca_tutorial.
Hi @rfm-targa . Thank you so much for a comprehensive answer. Your post pretty much clarifies all my questions. Let me read a bit more about the LNF
, ASM
, PLOT
, NIPH
classifications and start working towards to generating a core genome alignment. If nothing you can close this issue. Thanks so much again.
from chewbbaca_tutorial.
Related Issues (9)
- Missing training file for tutorial HOT 3
- About the prodigal HOT 6
- external scheme creation
- Error in AlleleCall tutorial HOT 12
- How to get the cgMLST_all.tsv HOT 2
- NotADirectoryError: [Errno 20] Not a directory: 'listgenes_core.txt/.schema_config' HOT 9
- How to create listgenes_core.txt file? HOT 3
- Mistake in the tutorial at Evaluate genome quality / missing step? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chewbbaca_tutorial.