Code Monkey home page Code Monkey logo

3d-dna's Introduction

3D de novo assembly (3D DNA) pipeline

This code is designed to enable anyone to reproduce the Hs2-HiC and the AaegL4 genomes reported in: Dudchenko et al., De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 2017.

Unless otherwise noted, all terminology below is consistent with this paper, and all references to figures and tables in this readme refer to this paper. Specifically, some of the terminology used below is outlined in Figure S2. The assembly procedure is described in detail in the Supporting Online Materials, specifically in the section labelled “Pipeline description”.

In addition, the pipeline uses tools and methods from Juicer (Durand & Shamim et al., Cell Systems, 2016) and Juicebox (Durand & Robinson et al., Cell Systems, 2016), as well as additional dependencies noted below.

Feel free to post your questions and comments at: http://www.aidenlab.org/forum.html

As input, the pipeline requires: [1] a fasta file describing the draft assembly, and [2] a file called “merged_nodups.txt” which is generated by the Juicer pipeline, and contains a duplicate-free list of paired alignments from an in situ Hi-C experiment.

The output of the pipeline includes a fasta file containing the output scaffolds, as well as hic files corresponding to various stages of the assembly.

The content of this README is as follows:

1. Overview of the pipeline
2. Individual modules
3. Output files
4. Guide to reproducing the Hs2-HiC and AaegL4 assemblies
5. Citations and licensing

1. Overview of the pipeline

Software version: 170123

An overview of the detailed workflow of the 3D DNA pipeline is schematically given in Fig. S1 (see Dudchenko et al., 2017).

We begin with a series of iterative steps whose goal is to eliminate misjoins in the input scaffolds. Each step begins with a scaffold pool (initially, this pool is the set of input scaffolds themselves). The scaffolding algorithm is used to order and orient these scaffolds. Next, the misjoin correction algorithm is applied to detect errors in the scaffold pool, thus creating an edited scaffold pool. Finally, the edited scaffold pool is used as an input for the next iteration of the misjoin correction algorithm. The ultimate effect of these iterations is to reliably detect misjoins in the input scaffolds without removing correctly assembled sequence. After this process is complete, the scaffolding algorithm is applied to the revised input scaffolds, and the output – a single “megascaffold” which concatenates all the chromosomes – is retained for post-processing.

This post-processing includes four steps: (i) a polishing algorithm, which is required for genomes in the Rabl configuration; (ii) a chromosome splitting algorithm, which is used to extract the chromosome-length scaffolds from the megascaffold; (iii) a sealing algorithm, which detects false positives in the misjoin correction process, and restores the erroneously removed sequence from the original scaffold; and (iv) a merge algorithm, which corrects misassembly errors due to undercollapsed heterozygosity in the input scaffolds. Steps (i) and (iv) are omitted for Hs2-HiC, which is not in the Rabl configuration and lacks substantial undercollapsed heterozygosity.

Prerequisites

  • LastZ (version 1.03.73 released 20150708) – for AaegL4 only
  • Python >=2.7
  • scipy numpy matplotlib
  • Java version >=1.7
  • Bash >=4
  • GNU Awk >=4.0.2
  • GNU coreutils sort >=8.11

Recommended

  • GNU Parallel >=20150322 – highly recommended to increase performance

Installation

There is no need for installation.

Testing

The pipeline was run on the following operating systems:

  • Red Hat Enterprise Linux Server release 7.3
  • CentOS Linux release 7.2.1511
  • Ubuntu 14.04.5
  • OS X Yosemite 10.10.5

Overview

The pipeline consists of one bash wrapper script (run-pipeline.sh) that calls individual modules to assemble a genome. Run

./run-pipeline.sh –h

to list arguments and parameters. The latter include:

-m haploid/diploid - specifies whether to run the pipeline in the haploid (Hs2-HiC) or diploid (AaegL4) mode. The default is the haploid mode

-t int - specifies the minimal draft contig/scaffold size to be considered as input

-s int - specifies the number of rounds of misjoin correction

-c int - specifies the expected number of chromosomes

2. Individual modules

The wrapper script calls individual modules of the pipeline. Code related to any particular module is organized into individual folders. The modules can also be run as separate scripts. The list of individual modules with their main wrapper scripts is given below:

a) scaffold (./run-liger-scaffolder.sh) – scaffold the draft assembly

b) visualize (./run-asm-visualizer.sh) – make Juicebox-compatible contact map

c) edit (./run-misassembly-detector.sh) – annotate regions consistent with a misassembly

d) polish (./run-asm-pplisher.sh) – polish assembly (AaegL4 only)

e) split (./run-asm-splitter.sh) – split scaffolder output into raw chromosomal scaffolds

f) seal (./seal-asm.sh) – reintroduce some fragments back into the assembly

g) merge (./run-asm-merger.sh) – merge fragments of the assembly that represent alternative haplotypes (AaegL4 only)

h) finalize (./finalize-output.sh) – generates an output fasta

Additional modules include:

i) utils – several core scripts that are used across modules

j) lift – several core scripts to liftover coordinates from the draft to assembly and vice versa

k) supp – several additional scripts to match available map data and generate initial conditions

l) data – mapping data tables used for validation

3. Output files

The pipeline generates a number of files. The types of files are listed below. The main outputs are the fasta files annotated as “FINAL” which contains the output scaffolds.

a) .fasta files

  • “FINAL” – chromosome-length scaffolds, small and tiny scaffolds
  • “rawchrom” – Scaffold populations analyzed in Table S2.

b) .hic files

  • “polished” – after polishing stage (for diploid mode only)
  • “resolved” – after editing and scaffolding
  • [0123…] – correspond to the assembly at individual editing iterations

c) .scaffold_track.txt & .superscaf_track.txt

  • Scaffold boundary files (Juicebox 2D annotation format)

d) .bed & .wig files

  • Tracks illustrating putative misjoins

e) .cprops and .asm files

  • Custom files that tracks modifications to the input contigs at various stages in the assembly.

f) supplementary files:

  • edits.for.step.*.txt; mismatches.at.step.*.txt; suspect_2D.at.step.*.txt - list of problematic regions (Juicebox 2D annotation format)
  • alignments.txt (for diploid mode only) – pairwise alignment data for alternative haplotype candidates, parsed from LASTZ output

4. How to reproduce the Hs2-HiC and AaegL4 assemblies

a) Where to obtain the input files:

  • Draft fasta files:

    • For Hs2-HiC:

      Download the Hs1.fasta draft assembly from GEO (series accession number GSE95797).

    • For AaegL4:

      Run ./supp/get-AaegL2.sh

      The script will download the scaffold sequences from the AaegL2 assembly (NCBI accession number GCA_000004015.2) and order the fasta in accordance with the NCBI assembly_report file there.

  • Merged_nodups.txt files produced using Juicer and the draft assembly:

    • Download the Hs1.mnd.txt and AaegL2.mnd.txt from GEO (series accession number GSE95797).

b) How to run the pipeline:

  • Hs2-HiC: ./run-pipeline.sh –m haploid –t 15000 –s 2 –c 23 Hs1.fasta Hs1.mnd.txt

  • AaegL4: ./run-pipeline.sh –m diploid –t 10000 –s 9 –c 3 AaegL2.fasta AaegL2.mnd.txt

Note that the order and orientation of chromosome-length scaffolds in the final assemblies that we uploaded to GEO was changed to match the conventional order and orientation (i.e. hic_scaffold_1 in the GEO upload is a homolog to chromosome 1, hic_scaffold_2 is a homolog to chromosome 2 etc.).

5. Citations and licensing

If you use this code or the resulting assemblies, please cite the following paper:

Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, N.C., Shamim, M.S., Machol, I., Lander, E.S., Aiden, A.P., et al. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science eaal3327. Published online ahead of print on March 23, 2017.

Please also remember to cite the paper that generated the draft Aedes aegypti assembly:

Nene, V., Wortman, J.R., Lawson, D., Haas, B., Kodira, C., Tu, Z.J., Loftus, B., Xi, Z., Megy, K., Grabherr, M., et al. (2007). Genome sequence of Aedes aegypti, a major arbovirus vector. Science 316, 1718–1723.

This software is distributed under The MIT License (MIT).

3d-dna's People

Contributors

dudcha avatar sanjitsbatra avatar theaidenlab avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.