Code Monkey home page Code Monkey logo

chewbbaca_tutorial's Introduction

Objective

The objective of this tutorial is to illustrate the complete workflow of a chewBBACA pipeline for creating a wgMLST and a cgMLST schema for a colection of 714 Streptococcus agalactiae genomes (32 complete genomes and 682 draft genome assemblies deposited on the NCBI databases) by providing step-by-step instructions and displaying the obtained outputs.

All information about the NCBI genomes used in this example is on the .tsv file inside the genomes folder.

Please start by going through the following steps:

  1. Install chewBBACA. Check Installing chewBBACA for instructions on how to install chewBBACA. chewBBACA includes Prodigal training files for several species, including for Streptococcus agalactiae. You can check the list of available training files here. We have included the training file for Streptococcus agalactiae in this repository.
  2. Clone this repository to the local folder of your choice. To clone, run the following command:
    git clone https://github.com/B-UMMI/chewBBACA_tutorial
  3. Go to the top-level directory of the cloned repository, .../chewBBACA_tutorial/, and run unzip genomes/complete_genomes.zip to extract all the complete genomes.

The execution times reported in this tutorial were obtained for a DELL XPS13 (10th Generation Intel® Core™ i7-10710U Processor - 12MB Cache, up to 4.7 GHz, using 6 cores). Using a computer with less powerful specifications can greatly increase the duration of the analyses.

The commands used in this tutorial assume that the working directory is the top-level directory of the cloned repository, .../chewBBACA_tutorial/. The commands should be modified if they are executed from a different working directory. We have included the expected results for each section in the expected_results folder for reference (each subfolder has the name of one of the sections).

Schema creation

We will start by creating a wgMLST schema based on 32 Streptococcus agalactiae complete genomes (32 genomes with a level of assembly classified as complete genome or chromossome) available at NCBI. The sequences are present in the complete_genomes/ directory. To create the wgMLST schema, run the following command:

chewBBACA.py CreateSchema -i complete_genomes/ -o tutorial_schema --ptf Streptococcus_agalactiae.trn --cpu 6

The schema seed will be available at tutorial_schema/schema_seed. We passed the value 6 to the --cpu parameter to use 6 CPU cores, but you should pass a value based on the specifications of your machine. In our system, the process took 56 seconds to complete resulting on a wgMLST schema with 3128 loci. At this point the schema is defined as a set of loci each with a single representative allele.

Allele calling

The next step is to perform allele calling with the wgMLST schema created in the previous step for the 32 complete genomes. The allele call step determines the allelic profiles of the analyzed strains, identifying known and novel alleles in the analyzed genomes. Novel alleles are assigned an allele identifier and added to the schema. To perform allele call, run the following command:

chewBBACA.py AlleleCall -i complete_genomes/ -g tutorial_schema/schema_seed -o results32_wgMLST --cpu 6

The allele call used the default BSR threshold of 0.6 (more information on the threshold here) and took approximately 17 minutes to complete (an average of 32 seconds per genome). The allele call identified 14,720 novel alleles and added those alleles to the schema, increasing the number of alleles in the schema from 3,128 to 17,848.

Paralog detection

The next step in the analysis is to determine if some of the loci can be considered paralogs, based on the result of the wgMLST allele calling. The Allele call returns a list of Paralogous genes in the RepeatedLoci.txt file that can be found on the results32_wgMLST/results_<datestamp> folder. The RepeatedLoci.txt file contains a set of 20 loci that were identified as possible paralogs. These loci should be removed from the schema due to the potential uncertainty in allele assignment (for a more detailed description see the Alelle Calling entry on the wiki). To remove the set of 20 paralogous loci from the allele calling results, run the following command:

chewBBACA.py RemoveGenes -i results32_wgMLST/results_<datestamp>/results_alleles.tsv -g results32_wgMLST/results_<datestamp>/RepeatedLoci.txt -o results32_wgMLST/results_<datestamp>/results_alleles_NoParalogs.tsv

This will remove the columns matching the 20 paralogous loci from the allele calling results and save the allelic profiles into the results_alleles_NoParalogs.tsv file (the new file contains allelic profiles with 3108 loci).

cgMLST schema determination

We can now determine the set of loci in the core genome based on the allele calling results. The set of loci in the core genome is determined based on a threshold of loci presence in the analysed genomes. We can run the TestGenomeQuality module to determine the impact of several threshold values on the number of loci in the core genome.

chewBBACA.py TestGenomeQuality -i results32_wgMLST/results_<datestamp>/results_alleles_NoParalogs.tsv -n 13 -t 200 -s 5 -o results32_wgMLST/results_<datestamp>/genome_quality_32

The process will automatically open a HTML file with the following plot:

Genome quality testing of complete genomes larger image fig 1 or see interactive plot online

A set of 1136 loci were found to be present in all the analyzed complete genomes, while 1267 loci were present in at least 95%. For further analysis only the 1267 loci present in at least 95% of the complete genomes will be used. We selected that threshold value to account for loci that may not be identified due to sequencing coverage and assembly problems.

We can run the ExtraCgMLST module to quickly determine the set of loci in the core genome at 95%.

chewBBACA.py ExtractCgMLST -i results32_wgMLST/results_<datestamp>/results_alleles_NoParalogs.tsv -o results32_wgMLST/results_<datestamp>/cgMLST_95 --t 0.95

The list with the 1267 loci in the core genome at 95% is in the results32_wgMLST/results_<datestamp>/cgMLST_95/cgMLSTschema.txt file. This file can be passed to the --gl parameter of the AlleleCall process to perform allele calling only for the set of genes that constitute the core genome.

Allele call for 682 Streptococcus agalactiae assemblies

682 assemblies of Streptococcus agalactiae available on NCBI were downloaded (03-08-2016, downloadable zip file here, run unzip GBS_Aug2016.zip to extract genome files into a folder named GBS_Aug2016) and analyzed with MLST in order to exclude possibly mislabeled samples as Streptococcus agalactiae. Out of the 682 genomes, 2 (GCA_000323065.2_ASM32306v2 and GCA_001017915.1_ASM101791v1) were detected as being of a different species/contamination and were removed from the analysis.

Allele call was performed on the bona fide Streptococcus agalactiae 680 genomes using the 1267 loci that constitute the core genome at 95%. Paralog detection found no paralog loci.

chewBBACA.py AlleleCall -i path/to/GBS_Aug2016/ -g tutorial_schema/schema_seed --gl results32_wgMLST/results_<datestamp>/cgMLST_95/cgMLSTschema.txt -o results680_cgMLST --cpu 6

The process took approximately 39 minutes to complete (an average of 3.4 secs per genome).

Evaluate genome quality

We can now concatenate the cgMLST results for the 32 complete genomes with the cgMLST results for the 680 genomes to have all the results in a single file. To concatenate the allelic profiles of both analyses run the following command:

chewBBACA.py JoinProfiles -p1 results32_wgMLST/results_<datestamp>/cgMLST_95/cgMLST.tsv -p2 results680_cgMLST/results_<datestamp>/results_alleles.tsv -o cgMLST_all.tsv

The concatenated file was analyzed in order to assess the cgMLST allele quality attribution for all the genomes.

chewBBACA.py TestGenomeQuality -i cgMLST_all.tsv -n 13 -t 300 -s 5

Genome quality testing of all genomes larger image here fig 2 or see interactive plot online

While the number of loci present in 95% of genomes remains virtually constant at around 1200 loci, considering all or most of the genomes (90%<x≤100%) the number of loci present is lower and presents some variation when specific genomes are removed from the analysis.

We selected the results at the threshold of 25 for further analysis. Although this selection is somewhat arbitrary, when moving to a lower threshold there is step increase in the number of loci present in 95% and 99% of genomes that could represent the exclusion of a more divergent clade from the analysis. Furthermore, at the threshold of 25 there is an acceptable number of loci present in all considered genomes (650 genomes/440 loci), which we felt would afford a good discriminatory power.

The genomes that were removed at each threshold are indicated in the file analysis_all/removedGenomes.txt and a file analysis_all/removedGenomes_25.txt was created with only the genomes removed at the 25 threshold.

The following command creates a directory analysis_all/cgMLST_25/ and saves the cgMLST schema selected at the chosen threshold to the file cgMLST.tsv.

chewBBACA.py ExtractCgMLST -i cgMLST_all.tsv -o cgMLST_25 --g removedGenomes_25.txt

Minimum Spanning Tree

analysis_all/cgMLST_25/cgMLST.tsv was uploaded to Phyloviz online and can be accessed here

Genome Quality analysis

Since the quality of the used assemblies was not confirmed, it is possible that some of the assemblies included were of low quality. A general analysis of the assemblies show a N50 variation that ranges from 8055 to over 2.2M, while the number of contigs ranges between 1 and 553. These results made us suspect that the quality of the genomes could have affected the allele call results and consequently caused a significant drop in the number of loci detected as present in all genomes.

As stated previously, to obtain the cgMLST schema, some genomes (n=62) had to be removed since they were extremes cases of missing data. In order to assess the possible reason for their poor allele call performance, two plots were built. The removed genomes were then highlighted and dashed lines were drawn linking the values for the same genomes.

The first plot represents the total number of bp in contigs with a size >10 kbp and the N50 of the assemblies, sorted by decreasing values.

Genome Analysis larger image fig 3

The second plot represents the total number of contigs and the number of contigs >10kbp.

Genome Analysis 2 larger image fig 4

See interactive plot online

At first sight, most of the removed genomes (56/62) were located on the lower range of N50 and bp in contigs >10 kbp (fig.3) and the higher number of contigs (fig.4).

The 5 genomes that were outside this pattern were individually checked:

  1. GCA_000186445.1 here - 21 contigs but only 1 is above 10k (Scaffold with lot of Ns, 134 real contigs)
  2. GCA_000221325.2 here- NCBI curated it out of RefSeq because it had a genome length too large
  3. GCA_000427055.1 here- NCBI curated it out of RefSeq because it had many frameshifted proteins
  4. GCA_000289455.1 here- No ST found. We concluded the assembly has a problem but we have not yet identified it.
  5. GCA_000288835.1 here- NCBI curated it out of RefSeq because it had many frameshifted proteins

Schema Evaluation

Schema Evaluator was run on the cgMLST schema:

chewBBACA.py SchemaEvaluator -i tutorial_schema/schema_seed/ -o schema_evaluation --cpu 6

See the schema evaluator page here

chewbbaca_tutorial's People

Contributors

rfm-targa avatar mickaelsilva avatar ramirma avatar pedrorvc avatar

Stargazers

 avatar Julian Zaugg avatar  avatar  avatar Stefan Rooke avatar  avatar

Watchers

João André Carriço avatar James Cloos avatar  avatar Bruno Gonçalves avatar  avatar Miguel Machado avatar Inês Mendes avatar  avatar  avatar  avatar

Forkers

wangmz0617

chewbbaca_tutorial's Issues

Mistake in the tutorial at Evaluate genome quality / missing step?

Hi,

Am doing the tutorial with my own dataset (Staphylococci) and it is great. However, at the "Evaluate genome quality" step I struggled with two things:

  1. it is unclear how the removedGenomes_25.txt is created; I created it manually, but there is no clear instruction on how to generate it. A comment may be helpful.
  2. the command line "chewBBACA.py ExtractCgMLST -i cgMLST_all.tsv -o cgMLST_25 -g removedGenomes_25.txt" is incorrect, the "-g" should be "--g" (double hyphen)

Just the icing on the cake! :)

external scheme creation

I tried to creat salmonella scheme by myself and entered the command line as follows"chewBBACA.py CreateSchema -i cgMLST/ -o Salmonellascheme --cpu 8 --ptf training/Salmonella_enterica.trn", but it prompted error message "chewbbaca scheme creation error:BLAST Database error: No alias or index file found for protein database [/home/qifeng/temp/protogenome1/blastdbs/protogenome1_proteins_db] in search path [/home/qifeng::" halfway. How can I fix it.

How to get the cgMLST_all.tsv

In the tutorial, "Now the file cgMLST_completegenomes/cgMLST.tsv can be concatenated with the allele call result from the 680 genomes results_all/results_20180202T112710/results_alleles.tsv", I don't know how to concatenate these two files to get the cgMLST_all.tsv

Error in AlleleCall tutorial

Heisann guys.

I am doing running the chewbacca in a conda environment (tried to downloaded chewbbaca both by conda and pip), and have basically copy pasted all your commands etc. I encounter error msg when running allelcall:

^MProcessing GCA-000007265-protein10.fasta. Start 22:19:33-22/07/2019 Locus 2 of 3130. Done 0%.some error occu$
%i format: a number is required, not NoneType
Error on line 466

It is numerous times for most loci and results in non-valid error allele assignments. I went into the schema_seed and had a look at the different loci too, they just look normal.
what am I doing wrong?

AK

Missing training file for tutorial

The file required in one of the first steps in the tutorial is missing: Streptococcus_agalactiae.trn

Could you please supply it as part of the tutorial

NotADirectoryError: [Errno 20] Not a directory: 'listgenes_core.txt/.schema_config'

Hi,

Can request to know how to solve this error? Many thanks in advance.

chewBBACA.py AlleleCall -i .genomes/ -g listgenes_core.txt -o results --cpu 6 --ptf Streptococcus_agalactiae.trn

chewBBACA version: 2.5.5
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Wiki: https://github.com/B-UMMI/chewBBACA/wiki
Tutorial: https://github.com/B-UMMI/chewBBACA_tutorial
Contacts: [email protected]

==========================
  chewBBACA - AlleleCall
==========================
It seems that your schema was created with chewBBACA 2.1.0 or lower.
It is highly recommended that you run the PrepExternalSchema process to guarantee full compatibility with the new chewBBACA version.
If you wish to continue, the AlleleCall process will convert the schema to v2.5.5, but will not determine if schema structure respects configuration values.
Do you wish to proceed?
yes

Adding Prodigal training file to schema...
Created listgenes_core.txt/Streptococcus_agalactiae.trn

Creating file with schema configs...
Traceback (most recent call last):
  File "/storage/apps/anaconda3/bin/chewBBACA.py", line 10, in <module>
    sys.exit(main())
  File "/storage/apps/anaconda3/lib/python3.6/site-packages/CHEWBBACA/chewBBACA.py", line 1549, in main
    functions_info[process][1]()
  File "/storage/apps/anaconda3/lib/python3.6/site-packages/CHEWBBACA/chewBBACA.py", line 414, in allele_call
    schema_directory)
  File "/storage/apps/anaconda3/lib/python3.6/site-packages/CHEWBBACA/utils/auxiliary_functions.py", line 224, in write_schema_config
    pickle_dumper(config_file, params)
  File "/storage/apps/anaconda3/lib/python3.6/site-packages/CHEWBBACA/utils/auxiliary_functions.py", line 77, in pickle_dumper
    with open(pickle_out, 'wb') as po:
NotADirectoryError: [Errno 20] Not a directory: 'listgenes_core.txt/.schema_config'

Extracting core gene alignment

Hi @rfm-targa and authors,

Thank you very much for writing an amazing tool.

I have an external schema for stenotrophomonas through which I want to do cgMLST analysis and then generate a core gene alignment. May I know if the alignment could be created using cgMLST_completegenomes/Presence_Absence.tsv? May I know, if there is there any easy way to do with chewBBACA? Can request your advice. Thanks very much in advance!

Edit1: Hi. With a bit of reading, I think what I need is: To generate core gene alignment using cgMLST_completegenomes/cgMLST.tsv. Can I kindly request advice if my approach is correct and help in generating alignment so that I do not need to reinvent the wheel.

About the prodigal

Hi, I am new to chewBBACA. Through my practice and I noticed in the process of CreateSchema we should input a file suffixed with .trn, and I also was not familiar with prodigal. The codes are blow:

chewBBACA.py CreateSchema -i complete_genomes/ --cpu 6 -o schema_seed --ptf Streptococcus_agalactiae.trn

So, could you mind to tell me how to use prodigal to make a training set? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.