Code Monkey home page Code Monkey logo

mgse's Introduction

DOI

Mapping-based Genome Size Estimation (MGSE)

MGSE can harness the power of files generated in genome sequencing projects to predict the genome size. Required are the FASTA file containing a high continuity assembly and a BAM file with all available reads mapped to this assembly. The script construct_cov_file.py (https://doi.org/10.1186/s12864-018-5360-z) allows the generation of a COV file based on the (sorted) BAM file (also possible via MGSE directly). Next, this COV file can be used by MGSE to calculate the coverage in provided reference regions and to calculate the total number of mapped bases. Both values are subjected to the genome size estimation. Providing accurate reference regions is crucial for this genome size estimation. Different alternatives were evaluated and actual single copy BUSCOs (https://busco.ezlab.org/) appear to be the best choice. Running BUSCO prior to MGSE will generate all necessary files.

MGSE workflow (Pucker, 2021; doi:10.1101/607390)
Usage:
  python3 MGSE3.py [--cov <COV_FILE_OR_DIR> | --bam <BAM_FILE_OR_DIR>] --out <DIR>
                 [--ref <TSV> | --gff <GFF> | --busco <FULL_TABLE.TSV> | --all]

Mandatory:
  Coverage data (choose one)
  --cov STR          Coverage file (COV) created by construct_cov_file.py or directory containing
                     multiple coverage files
  --bam STR          BAM file to automatically create the coverage file
  
  Output directory
  --out STR          Output directory

  Reference regions to calculate average coverage (choose one)
  --ref STR          File containing TAB-separated chromosome, start, and end
  --gff STR          GFF3 file containing genes given by BUSCO
  --busco STR        BUSCO annotation file (full_table_busco_run.tsv)
  --all              Use all positions of the assembly
		
Optional:
  --black STR       Sequence ID list for exclusion
  --gzip            Search for files "*cov.gz" in --cov if this is a directory
  --bam_is_sorted   Do not sort BAM file
  --samtools STR    Full path to samtools (if not in your $PATH)
  --bedtools STR    Full path to bedtools (if not in your $PATH)
  --name STR        Prefix for output files []
  --m INT           Samtools sort memory [5000000000]
  --threads INT     Samtools sort threads [4]
  --plot TRUE|FALSE Activate or deactivate generation of figures via matplotlib[FALSE]
  --blackoff TRUE|FALSE Deactivate the black listing of contigs with high coverage values [FALSE]

WARNING:

  • MGSE requires absolute paths (at least use of absolute paths is recommended)
  • Per default contigs with very high coverage values are put on a black list to prevent inflation of the genome size prediciton by plastome contigs (in plants). However, this function can be disabled via --blackoff to estimate genome sizes with more fragmented assemblies.

Possible reference regions:

  1. --ref A very simple TAB-separated text file with information about chromosome, start, and end of regions which should be used as a reference set for the coverage calculation.

  2. --gff A GFF3 file with genes which should serve as reference regions. Only the GFF3 file produced by BUSCO can be used here.

  3. --busco This will extract the single copy BUSCOs from the provided TSV file.

  4. --all All positions of the assembly will be included in the average coverage calculation.

Usage:
  python construct_cov_file.py

Mandatory:
  --in STR          Bam file
  --out STR         Output file

Optional:
  --bam_is_sorted   Do not sort BAM file
  --m INT           Samtools sort memory [5000000000]
  --threads INT     Samtools sort threads [4]

Reference:

Pucker B. Mapping-based genome size estimation. bioRxiv 607390; doi: https://doi.org/10.1101/607390

mgse's People

Contributors

bpucker avatar schellt avatar shaknat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mgse's Issues

small Genome size

Hi,
I was using MGSE to estimate genome size as follows:

python MGSE.py --bam taxon.sort.bam (Illumina reads mapped to genome) --out test --busco /run_busco/full_table_busco.tsv

But this revealed a very small genome size (half the size of other approaches such as Flow Cytometry and GenomeScope). I also tried python MGSE.py --bam taxon.sort.bam (Illumina reads mapped to genome) --out test --all but the genome size was even smaller.

Do you have any explanation for this observation?

Best,

Jacqueline

estimated genome size is not being displayed

Hi,
I've been trying to use MGSE to estimate the genome size of my species. Although I've followed the instructions to the program, it hasn't calculated the genome size for me. This is the output of the report.txt file
average coverage in reference regions (median): 0
average coverage in reference regions (mean): 0
total coverage (combined length of all mapped reads): 21971396415.0
total sequence length: 255888427
estimated genome size based on mean [Mbp]: 0
estimated genome size based on median [Mbp]: 0

This is the command I've run:
python2.7 MGSE.py --bam /home/buitracn/Genomes/corals/Pocillopora.verrucosa/01_Pver_lib4_DISCOVAR_genome_assembly_final/Pver_lib4_DISCOVAR_Assembly_Filtering/09_MGSE_GenomeSizeEstimation/Pver_alignment2haploidGenome_sorted_local.bam --bam_is_sorted --out /home/buitracn/Genomes/corals/Pocillopora.verrucosa/01_Pver_lib4_DISCOVAR_genome_assembly_final/Pver_lib4_DISCOVAR_Assembly_Filtering/09_MGSE_GenomeSizeEstimation --busco /home/buitracn/Genomes/corals/Pocillopora.verrucosa/01_Pver_lib4_DISCOVAR_genome_assembly_final/Pver_lib4_DISCOVAR_Assembly_Filtering/09_MGSE_GenomeSizeEstimation/full_table_6_Pver_lib4_filtered_ScaffCSAR_Gapfillep_A_ref_D_augustus_fly.tsv --name Pver

If you could please help me figure out what I've done wrong I'll greatly appreciate

Sincerely,

Carol

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.