This repository is used to provide a molecular epidemiology analysis of SARS-CoV-2 data from GISAID.
The molecular-epi-ncov package requires a standard server or computer with enough RAM to provide support for operations performed in memory.
All code is tested on Linux: Ubuntu 20.04.1 and MacOSX operating systems.
The R script is compatible with Windows, Mac, and Linux operating systems.
You will require the following software installed on your server or computer before starting:
From a terminal, use pip
to install the following Python dependencies before running the script:
pip install pandas seaborn matplotlib
From an R session, type:
install.packages("reshape2", "ggplot2", "htmlwidgets", "webshot")
devtools::install_github("hrbrmstr/streamgraph")
- ancestral_sequence: Mutation table of ancestral sequences reconstructed by treetime.
- clade-distribution:
R
package code. - exampledata: Example data randomly extracted from the Gisaid database and China: Guangdong sequences.
- accession_number.txt: Accession numbers of China: Guangdong sequences.
- participant_characteristics.png: SARS-CoV-2 seroprevalence survey and participant characteristics
To obtain the most recent GISAID SARS-CoV-2 data in a single file, you can use the batch download feature on the GISAID website(https://gisaid.org/). The files you need are metadata_tsv_2024_01_27.tar.xz and sequences_fasta_2024_01_27.tar.xz. Please note that membership is required to access this feature.
You can download the SARS-CoV-2 reference dataset, to run :
nextclade dataset get \
--name 'nextstrain/sars-cov-2/wuhan-hu-1' \
--tag '2024-01-28T00:00:00Z' \
--output-dir '~/sars-cov-2-2024-01-28update'
To filter sequences that meet the criteria suspiciously clustered single-nucleotide polymorphisms (SNPs) [quality control (QC) SNP clusters status metric not “good”; ≥ 6 mutations in 100 bases], too many private mutations (QC private mutations status metric not good; ≥ 10 mutations from the nearest tree node), or overall bad quality (Nextclade QC overall status “bad”)
, run seq_filter.ipynb(global sequences) or gd_seq.ipynb(China: Guangdong sequences) in Jupyter Notebook or VS Code.
Run clade_distribution.ipynb
to obtain weekly_clade_distribution.csv
, then run clade-distribution/scripts/clade.R
to obtain genotype distribution plots.
Note
Before starting, run ancestral_sequence.sh to get ancestral sequence.
To get the mutation heatmap of SARS-CoV-2, run mutation_heatmap.ipynb. Ancestral sequences were reconstructed using TreeTime. Taking ba.5.2.48 as an example, run the script sh ancestral_sequence.sh
to retrieve its ancestral node. The command treetime ancestral
is computationally intensive and requires a significant amount of memory.
This project is covered under MIT License.