Code Monkey home page Code Monkey logo

dipasm's Introduction

DipAsm: Efficient chromosome-scale haplotype-resolved assembly of human genomes

Haplotype-resolved or phased sequence assembly provides a complete picture of genomes and complex genetic variations. However, current phased assembly algorithms either fail to generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method that leverages long accurate reads and long-range conformation data for single individuals to generate chromosome-scale phased assembly within a day. Applied to three public human genomes, PGP1, HG002, and NA12878, our method produced haplotype-resolved assemblies with contig NG50 up to 25 Mb and phased ∼99.5% of heterozygous sites to 98–99% accuracy, outperforming trio-based approach in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies to discover structural variants, including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as HLA and KIR. Our improved method will enable high-quality precision medicine and facilitate new studies of individual haplotype variation and population diversity.

See our preprint here: https://doi.org/10.1101/810341.

Installation

DipAsm requires docker as DeepVariant uses it. Users need to make sure docker is installed and the docker service is started.

mkdir -p dipasm
cd dipasm
git clone https://github.com/shilpagarg/DipAsm.git
cd DipAsm/docker
docker build -t dipasm .
cd ../../..
docker run -it --rm -v $PWD/dipasm/DipAsm:/wd/dipasm/DipAsm/ -e HOSTWD=$PWD/dipasm/DipAsm -v /var/run/docker.sock:/var/run/docker.sock dipasm:latest /bin/bash

The docker run -it will start an interactive docker container session. You will be in the virtual container envrionment which have the preinstall DipAsm and testing data.

Test example with docker

You can run the test for DipAsm within the docker container environment by:

cd /wd/dipasm/DipAsm
bash test.sh | bash
ls test_output/out/assemble/test-H?.fasta  # final assembly

Run

Once you enter the virtual Docker environment with the docker command line shown above, go into the /wd/dipasm/DipAsm directory and run python pipeline.py. You will see:

Usage: pipeline.py [-h] --hic-path PATH --pb-path PATH --sample NAME [--female]
                   --prefix STR

Optional arguments:
  -h, --help         show this help message and exit
  --hic-path PATH    Use Hi-C data from this path. Should be named by *1.fastq
                     and *2.fastq.
  --pb-path PATH     Use PacBioCCS data from this path. All fastq will be
                     used.
  --sample NAME      Sample name to put for Read Group of BAM and Sample of
                     VCF.
  --prefix STR       Prefix name for the experiment, for example "refBased",
                     "ragooBased".

Example:

python pipeline.py --hic-path data/hic --pb-path data/pacbiocss --sample PGP1 --prefix asm

Results

This pipeline produces phased assemblies as sample_output/prefix/assemble/sample-H?.fasta.

Acknowledgements

DipAsm depends on Peregrine, 3d-dna, minimap2, DeepVariant, whatshap and hapcut2.

dipasm's People

Contributors

cschin avatar lh3 avatar mvfki avatar shilpagarg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dipasm's Issues

gzip: stdout: Broken pipe

hi,

When I run the following command,

/public/home/jenny/software/pstools hic_mapping_unitig -t 60 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1-fastped_R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1-fastped_R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1-fastped_R2.fq.gz /public/home/jennyu/01data/03hic/clean_data/02small_clean/A1-fastped_R2.fq.gz)

the following error occurred, I'm working on the first mistake, but I don't understand the second one. The second error look like a problem occued when read from the pipe, I'm not sure。

/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /public/home/jenny/software/pstools)
/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /public/home/jenny/software/pstools)

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe

I then tried to modify the command (the following), but still reported the same second error.

/public/home/jenny/software/pstools hic_mapping_unitig -t 60 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R2.fq.gz)

and

/public/home/jenny/software/pstools hic_mapping_unitig -t 60 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz)

It is OK when I run zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz directly on terminal

When I run commands directly on terminal (manage nodes and subnodes), it‘s OK and only get the first mistake.

/public/home/jenny/software/pstools hic_mapping_unitig -t 1 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz)  <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R2.fq.gz)

Error is as follows (just the first mistake)

/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /public/home/jenny/software/pstools)
/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /public/home/jenny/software/pstools)

But when I put it (command above) in the script and ran it, I got an error.

$ sh pstools4.sh
pstools4.sh: line 1: syntax error near unexpected token `('
pstools4.sh: line 1: `/public/home/jenny/software/pstools hic_mapping_unitig -t 1 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz)  <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R2.fq.gz)'

I'm at my wit's end.

GZIP version information

gzip 1.5
Copyright (C) 2007, 2010, 2011 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Thanks for any advice,
Best regards,
Jenny

The result of the run is returned, but "scaffold_connection.txt" is 0 in the result file

Dear Shilpagarg,

Thanks for the instructions. It finally got out. There is no error report in the log file.

  1. One of the output files, "scaffold_connection.txt," is "0" and has nothing in it, I'm not sure if there's a problem.
  2. I'm not sure what each file means. Whether "pred_hap1" and "pred_hap2" are the two sets of haplotypes, why are they so different? Is it normal? Is it because the heterozygosity (approximate 1.3%) is not high enough?
  3. And most importantly, I'm not sure how to validate, evaluate or improve these two haplotypes.

8 files were generated as a result of the run,as shown below (The "1pst_" tag was added when I modified the file name)

-rw-r--r-- 1 jenny JJ  20G Jan 29 23:27 1pst_hic_name_connection.output
-rw-r--r-- 1 jenny JJ 877M Jan 30 01:30 1pst_pred_broken_nodes.fa
-rw-r--r-- 1 jenny JJ 805M Jan 30 06:15 1pst_pred_hap1.fa
-rw-r--r-- 1 jenny JJ  81M Jan 30 06:15 1pst_pred_hap2.fa
-rw-r--r-- 1 jenny JJ 1.6G Jan 30 01:30 1pst_pred_haplotypes.fa
-rw-r--r-- 1 jenny JJ  58M Jan 30 06:15 1pst_scaff_connections.txt
-rw-r--r-- 1 jenny JJ   0 Jan 30 01:30 1pst_scaffold_connection.txt
-rw-r--r-- 1 jenny JJ  515 Jan 30 06:15 1pst_scaffold_result_reordered.txt

Thanks for any advice,
Best regards,
Jenny

Pstools pre_hap1 or pre_hap2

i got a problem by using pstools,the length of scaffold0l_hap1 and scaffold1l_hap2 is very long in pre_hap1.fa file ,longer than expected the true length of chromosome (<200Mb) ,however, the pre_hap2.fa file split scaffold0l_hap1/scaffold1l_hap2 into different smaller scaffolds.thoes result show in blew,i dont why ,could you help me ,thanks~
image
image

Fast and accurate chromosome-scale phased assembly

Dear users,

Sorry for the late reply. Dipasm is a proof-of-concept tool. Keeping in mind user-friendliness, I recommend the following instructions for fast and accurate chromosome-scale phased assembly using hifi and hi-c:

  1. Produce hifiasm r_utg graph. Binary: https://pstools.s3.us-east-2.amazonaws.com/hifiasm
  2. Pstools binary: https://pstools.s3.us-east-2.amazonaws.com/pstools_1.
./hifiasm -o pgp1.asm -t60 ../data/pgp1/hifi/*.fastq.gz
awk '/^S/{print ">"$2;print $3}' pgp1.asm.r_utg.gfa > pgp1.asm.r_utg.fa;
./pstools hic_mapping_unitig -t64 pgp1.asm.r_utg.fa <(zcat ../data/PGP1/hic/SRR8310069_1.fastq.gz ../data/PGP1/hic/SRR8310062_1.fastq.gz) <(zcat ../data/PGP1/hic/SRR8310069_2.fastq.gz ../data/PGP1/hic/SRR8310062_2.fastq.gz); ./pstools resolve_haplotypes -t64 hic_name_connection.output pgp1.asm.r_utg.gfa ./; ./pstools hic_mapping_haplo -t64 pred_haplotypes.fa <(zcat ../data/PGP1/hic/SRR8310069_1.fastq.gz ../data/PGP1/hic/SRR8310062_1.fastq.gz) <(zcat ../data/PGP1/hic/SRR8310069_2.fastq.gz ../data/PGP1/hic/SRR8310062_2.fastq.gz) -o scaff_connections.txt; ./pstools haplotype_scaffold -t64 scaff_connections.txt pred_haplotypes.fa ./

Please update data files as per your use case.

In theory, it should work for plants. Please let me know if you have any questions.

Look forward to working with you.

memory: core dumped at haplotype_scaffold

Hi Shilpa,

I'm running hifiasm/pstools as in #16 on an ~100Mb genome, expected to be mostly haploid. I'm assuming this shouldn't be a major issue? I don't really trust the base-level results of short-read HiC assembly/scaffolding on HiFi tigs, and I'm hoping DipAsm will do a better job of it.

I get through with some minor (I assume) complaints (a few ERROR: key not in position table during hic_mapping_haplo, and various rm errors during resolve_haplotypes), but then a core dump during the haplotype_scaffold stage. There are 56 utgs for each of hap1 and hap2 in pred_haplotypes.fa, each ~250Mb. Any thoughts?

Here's that log:
start main
All above 5M: 13
All above 1.5M: 44
Update best buddy score.
Get potential connections 4.
Insert connections.
Save graphs and scores.
Nodes in graph: 2.
Left edges: 376.
Update best buddy score.
Get potential connections 4.
Insert connections.
Save graphs and scores.
Nodes in graph: 4.
Left edges: 184.
Update best buddy score.
Update best buddy score.
Get potential connections 4.
Insert connections.
Save graphs and scores.
Nodes in graph: 5.
Left edges: 304.
Update best buddy score.
Update best buddy score.
Finish get first scaffolds.
free(): invalid pointer

CommandNotFoundError

Hi,

Thanks for this useful tool and I met with a problem when I "bash test.sh | bash", and the error is "CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'". So could you please tell me how to solve this issue?

Scaffolded by 3D-DNA or HiRise in unphased assembly

Dear author
The paper "Accurate chromosome-scale haplotype-resolved assembly of human genomes " described there was a step that grouping and ordering contigs into scaffolds with Hi-C data using HiRise/3D between "Assemble HiFi reads into unphased contigs using Peregrine" and "map HiFi reads to scaffolds and call heterozygous SNPs using DeepVariant ". But I noticed that pipeline.sh (https://github.com/shilpagarg/DipAsm/blob/master/pipeline.sh) has skipped this step. Could you tell me if there are something wrong or my errant understanding? Looking forward to your reply.

Applying to contigs or chromosome-level scaffolds

Hi,

Thank you for sharing this tool with the community. After reading the manual and issues, I would like to use DipAsm for my genome projects.

FYI,
Input data: PacBio (Sequel) with ~200X coverage (raw read N50 = ~14Kb)
Assembled contigs: ~1,000 cotigs (N50 = ~3Mb)
Chromosome-level scaffolds: 20 Chromosomes with Pore-C data (PromethION)
Trio data: Not available

Given the circumstances, I would like to ask your professional opinions on which stage would be better to use DipAsm 1) "Assembled contigs" or 2) "Chromosome-level scaffolds"?

Thank you in advance!

what's the recommended memory to run DipAsm?

the server has 256GB memory,but when run this pipeline on the 2-overlap(shmr_overlap running)step, some threads report out of memory error. so I two questions:
1.what's the required memory needed?
2.when some threads report out of memory error , but i found other threads are still running, are the final results correct?
thanks.

install problem

I use these codes to install dipasm:

mkdir -p dipasm
cd dipasm
git clone https://github.com/shilpagarg/DipAsm.git
cd DipAsm/docker
docker build -t dipasm .
cd ../../..
docker run -it --rm -v $PWD/dipasm/DipAsm:/wd/dipasm/DipAsm/ -e HOSTWD=$PWD/dipasm/DipAsm -v /var/run/docker.sock:/var/run/docker.sock dipasm:latest /bin/bash

But the output:

$ docker build -t dipasm .
Sending build context to Docker daemon   7.68kB
Step 1/22 : FROM continuumio/miniconda3
Head "https://registry-1.docker.io/v2/continuumio/miniconda3/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Acontinuumio%2Fminiconda3%3Apull&service=registry.docker.io": net/http: TLS handshake timeout

$docker build -t dipasm .
Sending build context to Docker daemon   7.68kB
Step 1/22 : FROM continuumio/miniconda3
Head "https://registry-1.docker.io/v2/continuumio/miniconda3/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Acontinuumio%2Fminiconda3%3Apull&service=registry.docker.io": net/http: TLS handshake timeout

I am quite unfamilar with docker. I don't know how to solve it.

permission denied of installment

Hi, I am trying to install DipAsm but i got stuck when I am using 'docker build -t dipasm . '. The error message is: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/build?buildargs=%7B%7D&cachefrom=%5B%5D&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=Dockerfile&labels=%7B%7D&memory=0&memswap=0&networkmode=default&rm=1&session=zkolg6f11j9mkz4i03w0kefmb&shmsize=0&t=dipasm&target=&ulimits=null&version=1: dial unix /var/run/docker.sock: connect: permission denied.
Thanks for the helping!!
Best wishes,
Haidong

error: ".FALSE" ?

hi,
The following screen outputted when I ran the command python pipeline.py --hic-path /public/home/jenny/data/hic/ --pb-path /public/home/jenny/data/ccs/ --sample nl --prefix dip_asm

./pipeline.sh /public/home/jenny/data/hic/ /public/home/jenny/data/ccs/ nl FALSE dip_asm

I'm not sure if this is an error message, I have checked HIC and HIFI data, both of which are FASTQ files and have been decompressed, but there are other files under this path, such as the files before decompression. I can't think of the cause of this mistake

Thanks for any advice,
Best regards,
Jenny

Quality control for hifi and hic data

For hifi data, I only run CSS software with reads QV value great 0.99 parameter,but HiC data is not filtered. I don't know if this processing of raw data will have an impact on the assembly result ? or what do you recommend for HIC and hifi data quality control ? thanks~

Installing using singularity

I would like to install the "DipAsm" ; however it is not possible for me to use Docker on our servers directly. Therefore, we have to use docker file through singularity. I would be appreciate if you could provide any instruction that makes it possible to run the software using singularity.

DipAsm pipeline

Dear Shilpa Garg:
The paper "Chromosome-scale, haplotype-resolved assembly of human genomes" described there was a step that grouping and ordering contigs into scaffolds with Hi-C data using 3D-DNA .
“We mapped Hi-C reads to contigs with BWA-MEM v.0.7.17 and scaffolded the Peregrine contigs with juicer v.1.5 and 3D-DNA v.180922. We preprocessed data with ‘juicer.sh -d juicer -p chrom.sizes -y cut-sites.txt -z contigs.fa -D’, where file ‘cut-sites.txt’ was generated using the generate_site_positions_Arima.py script, which outputs merged_nodups.txt. The scaffolds were produced with ‘run-asm-pipeline.sh -m haploid contigs.fa merged_nodups.txt’. We then called small variants using DeepVariant v.0.8.0 with the pretrained PacBio model”
But I noticed that pipeline.sh
(https://github.com/shilpagarg/DipAsm/pipeline.sh) has skipped this step.
Could you tell me if there are something wrong or my errant understanding? Looking forward to your reply.

DipAsm

Dear author
On the server, I installed DipAsm through the docker. I found out it needed three software 3D-DNA, hapsCut this process,However, I found that there is a software that has been unable to install, that is, Deepvariant,(I installed it through docker) it error is as follows
image

What should I do to solve the problem
Looking forward to your reply

divergent genome size between two haplotypes after using pstools

Hi, Shilpagarg

My case just similar with #18, but after I concatenate brokeb_nodes.fa to hap2. It's still have big different genome size between two haplotypes.

 62M -rw-r--r-- 1 root     root   62M Feb  2 09:52 pred_hap2.fa
756M -rw-r--r-- 1 root     root  756M Feb  2 09:52 pred_hap1.fa
4.0K -rw-r--r-- 1 root     root  1.4K Feb  2 09:52 scaffold_result_reordered.txt
2.1M -rw-r--r-- 1 root     root  2.1M Feb  2 01:56 scaff_connections.txt
   0 -rw-r--r-- 1 root     root     0 Feb  1 18:37 scaffold_connection.txt
137M -rw-r--r-- 1 root     root  137M Feb  1 18:37 pred_broken_nodes.fa
1.5G -rw-r--r-- 1 root     root  1.5G Feb  1 18:37 pred_haplotypes.fa
 41G -rw-r--r-- 1 root     root   41G Feb  1 15:19 hic_name_connection.output

I cat pred_broken_nodes.fa and pred_hap2.fa together. The hap2 only 199Mb but hap1 is 756Mb. Besides, My expected genome size is about 500Mb.

pstools_1 "Segmentation fault (core dumped)"

Hi, when I using pstools_1 as #16 ./pstools_1 haplotype_scaffold -t 15 scaff_connections.txt pred_haplotypes.fa /public/home/pxxiao/practice_self/mjw, this error occurred:

start main
All above 5M: 0
All above 1.5M: 0
Segmentation fault (core dumped)

How can I solve this problem.
Thanks.

Can Dipasm be used for plant genomes?

I have read a paper about this tool"Accurate chromosome-scale haplotype-resolved assembly of human genomes ". But i'm not sure this tool could be used for plant genome, if i assembled a genome by hifiasm. In addition, I just want to use this tool to phase my genome. how to do this? Thanks a million!

can DipAsm use on singularity?

Hello ,
Can DipAsm supported by singularity?
Because i can't use sudo to run docker,maybe you can add some workflow about how to use singularity to run DipAsm?
And also can the Hic input be .fq.gz?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.