shilpagarg / dipasm Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Dear Shilpa Garg:
The paper "Chromosome-scale, haplotype-resolved assembly of human genomes" described there was a step that grouping and ordering contigs into scaffolds with Hi-C data using 3D-DNA .
“We mapped Hi-C reads to contigs with BWA-MEM v.0.7.17 and scaffolded the Peregrine contigs with juicer v.1.5 and 3D-DNA v.180922. We preprocessed data with ‘juicer.sh -d juicer -p chrom.sizes -y cut-sites.txt -z contigs.fa -D’, where file ‘cut-sites.txt’ was generated using the generate_site_positions_Arima.py script, which outputs merged_nodups.txt. The scaffolds were produced with ‘run-asm-pipeline.sh -m haploid contigs.fa merged_nodups.txt’. We then called small variants using DeepVariant v.0.8.0 with the pretrained PacBio model”
But I noticed that pipeline.sh
(https://github.com/shilpagarg/DipAsm/pipeline.sh) has skipped this step.
Could you tell me if there are something wrong or my errant understanding? Looking forward to your reply.
I have read a paper about this tool"Accurate chromosome-scale haplotype-resolved assembly of human genomes ". But i'm not sure this tool could be used for plant genome, if i assembled a genome by hifiasm. In addition, I just want to use this tool to phase my genome. how to do this? Thanks a million!
I would like to install the "DipAsm" ; however it is not possible for me to use Docker on our servers directly. Therefore, we have to use docker file through singularity. I would be appreciate if you could provide any instruction that makes it possible to run the software using singularity.
Dear Shilpagarg,
Thanks for the instructions. It finally got out. There is no error report in the log file.
8 files were generated as a result of the run,as shown below (The "1pst_" tag was added when I modified the file name)
-rw-r--r-- 1 jenny JJ 20G Jan 29 23:27 1pst_hic_name_connection.output
-rw-r--r-- 1 jenny JJ 877M Jan 30 01:30 1pst_pred_broken_nodes.fa
-rw-r--r-- 1 jenny JJ 805M Jan 30 06:15 1pst_pred_hap1.fa
-rw-r--r-- 1 jenny JJ 81M Jan 30 06:15 1pst_pred_hap2.fa
-rw-r--r-- 1 jenny JJ 1.6G Jan 30 01:30 1pst_pred_haplotypes.fa
-rw-r--r-- 1 jenny JJ 58M Jan 30 06:15 1pst_scaff_connections.txt
-rw-r--r-- 1 jenny JJ 0 Jan 30 01:30 1pst_scaffold_connection.txt
-rw-r--r-- 1 jenny JJ 515 Jan 30 06:15 1pst_scaffold_result_reordered.txt
Thanks for any advice,
Best regards,
Jenny
Hi,
Thanks for this useful tool and I met with a problem when I "bash test.sh | bash", and the error is "CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'". So could you please tell me how to solve this issue?
Hello ,
Can DipAsm supported by singularity?
Because i can't use sudo to run docker,maybe you can add some workflow about how to use singularity to run DipAsm?
And also can the Hic input be .fq.gz?
the server has 256GB memory,but when run this pipeline on the 2-overlap(shmr_overlap running)step, some threads report out of memory error. so I two questions:
1.what's the required memory needed?
2.when some threads report out of memory error , but i found other threads are still running, are the final results correct?
thanks.
Dear author
The paper "Accurate chromosome-scale haplotype-resolved assembly of human genomes " described there was a step that grouping and ordering contigs into scaffolds with Hi-C data using HiRise/3D between "Assemble HiFi reads into unphased contigs using Peregrine" and "map HiFi reads to scaffolds and call heterozygous SNPs using DeepVariant ". But I noticed that pipeline.sh (https://github.com/shilpagarg/DipAsm/blob/master/pipeline.sh) has skipped this step. Could you tell me if there are something wrong or my errant understanding? Looking forward to your reply.
Hi Shilpa,
I'm running hifiasm/pstools as in #16 on an ~100Mb genome, expected to be mostly haploid. I'm assuming this shouldn't be a major issue? I don't really trust the base-level results of short-read HiC assembly/scaffolding on HiFi tigs, and I'm hoping DipAsm will do a better job of it.
I get through with some minor (I assume) complaints (a few ERROR: key not in position table
during hic_mapping_haplo, and various rm
errors during resolve_haplotypes), but then a core dump during the haplotype_scaffold stage. There are 56 utgs for each of hap1 and hap2 in pred_haplotypes.fa, each ~250Mb. Any thoughts?
Here's that log:
start main
All above 5M: 13
All above 1.5M: 44
Update best buddy score.
Get potential connections 4.
Insert connections.
Save graphs and scores.
Nodes in graph: 2.
Left edges: 376.
Update best buddy score.
Get potential connections 4.
Insert connections.
Save graphs and scores.
Nodes in graph: 4.
Left edges: 184.
Update best buddy score.
Update best buddy score.
Get potential connections 4.
Insert connections.
Save graphs and scores.
Nodes in graph: 5.
Left edges: 304.
Update best buddy score.
Update best buddy score.
Finish get first scaffolds.
free(): invalid pointer
Could l use this command to install the docker environment?
docker pull luhancheng/dipasm
Is there a difference between these ways?
It is hard for me to install through your instruction.
Thank you!
Dear author
On the server, I installed DipAsm through the docker. I found out it needed three software 3D-DNA, hapsCut this process,However, I found that there is a software that has been unable to install, that is, Deepvariant,(I installed it through docker) it error is as follows
What should I do to solve the problem
Looking forward to your reply
Hi,
Thank you for sharing this tool with the community. After reading the manual and issues, I would like to use DipAsm for my genome projects.
FYI,
Input data: PacBio (Sequel) with ~200X coverage (raw read N50 = ~14Kb)
Assembled contigs: ~1,000 cotigs (N50 = ~3Mb)
Chromosome-level scaffolds: 20 Chromosomes with Pore-C data (PromethION)
Trio data: Not available
Given the circumstances, I would like to ask your professional opinions on which stage would be better to use DipAsm 1) "Assembled contigs" or 2) "Chromosome-level scaffolds"?
Thank you in advance!
I use these codes to install dipasm:
mkdir -p dipasm
cd dipasm
git clone https://github.com/shilpagarg/DipAsm.git
cd DipAsm/docker
docker build -t dipasm .
cd ../../..
docker run -it --rm -v $PWD/dipasm/DipAsm:/wd/dipasm/DipAsm/ -e HOSTWD=$PWD/dipasm/DipAsm -v /var/run/docker.sock:/var/run/docker.sock dipasm:latest /bin/bash
But the output:
$ docker build -t dipasm .
Sending build context to Docker daemon 7.68kB
Step 1/22 : FROM continuumio/miniconda3
Head "https://registry-1.docker.io/v2/continuumio/miniconda3/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Acontinuumio%2Fminiconda3%3Apull&service=registry.docker.io": net/http: TLS handshake timeout
$docker build -t dipasm .
Sending build context to Docker daemon 7.68kB
Step 1/22 : FROM continuumio/miniconda3
Head "https://registry-1.docker.io/v2/continuumio/miniconda3/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Acontinuumio%2Fminiconda3%3Apull&service=registry.docker.io": net/http: TLS handshake timeout
I am quite unfamilar with docker. I don't know how to solve it.
Hi, I am trying to install DipAsm but i got stuck when I am using 'docker build -t dipasm . '. The error message is: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/build?buildargs=%7B%7D&cachefrom=%5B%5D&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=Dockerfile&labels=%7B%7D&memory=0&memswap=0&networkmode=default&rm=1&session=zkolg6f11j9mkz4i03w0kefmb&shmsize=0&t=dipasm&target=&ulimits=null&version=1: dial unix /var/run/docker.sock: connect: permission denied.
Thanks for the helping!!
Best wishes,
Haidong
For hifi data, I only run CSS software with reads QV value great 0.99 parameter,but HiC data is not filtered. I don't know if this processing of raw data will have an impact on the assembly result ? or what do you recommend for HIC and hifi data quality control ? thanks~
Hi, when I using pstools_1 as #16 ./pstools_1 haplotype_scaffold -t 15 scaff_connections.txt pred_haplotypes.fa /public/home/pxxiao/practice_self/mjw
, this error occurred:
start main
All above 5M: 0
All above 1.5M: 0
Segmentation fault (core dumped)
How can I solve this problem.
Thanks.
hi,
When I run the following command,
/public/home/jenny/software/pstools hic_mapping_unitig -t 60 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1-fastped_R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1-fastped_R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1-fastped_R2.fq.gz /public/home/jennyu/01data/03hic/clean_data/02small_clean/A1-fastped_R2.fq.gz)
the following error occurred, I'm working on the first mistake, but I don't understand the second one. The second error look like a problem occued when read from the pipe, I'm not sure。
/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /public/home/jenny/software/pstools)
/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /public/home/jenny/software/pstools)
gzip: stdout: Broken pipe
gzip: stdout: Broken pipe
I then tried to modify the command (the following), but still reported the same second error.
/public/home/jenny/software/pstools hic_mapping_unitig -t 60 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R2.fq.gz)
and
/public/home/jenny/software/pstools hic_mapping_unitig -t 60 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz)
It is OK when I run zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz
directly on terminal
When I run commands directly on terminal (manage nodes and subnodes), it‘s OK and only get the first mistake.
/public/home/jenny/software/pstools hic_mapping_unitig -t 1 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R2.fq.gz)
Error is as follows (just the first mistake)
/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /public/home/jenny/software/pstools)
/public/home/jenny/software/pstools: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /public/home/jenny/software/pstools)
But when I put it (command above) in the script and ran it, I got an error.
$ sh pstools4.sh
pstools4.sh: line 1: syntax error near unexpected token `('
pstools4.sh: line 1: `/public/home/jenny/software/pstools hic_mapping_unitig -t 1 JT_hifiasm.asm.r_utg.fa <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R1.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R1.fq.gz) <(zcat /public/home/jenny/01data/03hic/clean_data/01big_clean/A1_fastped.R2.fq.gz /public/home/jenny/01data/03hic/clean_data/02small_clean/A1_fastped.R2.fq.gz)'
I'm at my wit's end.
GZIP version information
gzip 1.5
Copyright (C) 2007, 2010, 2011 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
Thanks for any advice,
Best regards,
Jenny
Hi, Shilpagarg
My case just similar with #18, but after I concatenate brokeb_nodes.fa to hap2. It's still have big different genome size between two haplotypes.
62M -rw-r--r-- 1 root root 62M Feb 2 09:52 pred_hap2.fa
756M -rw-r--r-- 1 root root 756M Feb 2 09:52 pred_hap1.fa
4.0K -rw-r--r-- 1 root root 1.4K Feb 2 09:52 scaffold_result_reordered.txt
2.1M -rw-r--r-- 1 root root 2.1M Feb 2 01:56 scaff_connections.txt
0 -rw-r--r-- 1 root root 0 Feb 1 18:37 scaffold_connection.txt
137M -rw-r--r-- 1 root root 137M Feb 1 18:37 pred_broken_nodes.fa
1.5G -rw-r--r-- 1 root root 1.5G Feb 1 18:37 pred_haplotypes.fa
41G -rw-r--r-- 1 root root 41G Feb 1 15:19 hic_name_connection.output
I cat pred_broken_nodes.fa and pred_hap2.fa together. The hap2 only 199Mb but hap1 is 756Mb. Besides, My expected genome size is about 500Mb.
hi,
The following screen outputted when I ran the command python pipeline.py --hic-path /public/home/jenny/data/hic/ --pb-path /public/home/jenny/data/ccs/ --sample nl --prefix dip_asm
./pipeline.sh /public/home/jenny/data/hic/ /public/home/jenny/data/ccs/ nl FALSE dip_asm
I'm not sure if this is an error message, I have checked HIC and HIFI data, both of which are FASTQ files and have been decompressed, but there are other files under this path, such as the files before decompression. I can't think of the cause of this mistake
Thanks for any advice,
Best regards,
Jenny
Dear users,
Sorry for the late reply. Dipasm is a proof-of-concept tool. Keeping in mind user-friendliness, I recommend the following instructions for fast and accurate chromosome-scale phased assembly using hifi and hi-c:
r_utg
graph. Binary: https://pstools.s3.us-east-2.amazonaws.com/hifiasm
https://pstools.s3.us-east-2.amazonaws.com/pstools_1
../hifiasm -o pgp1.asm -t60 ../data/pgp1/hifi/*.fastq.gz
awk '/^S/{print ">"$2;print $3}' pgp1.asm.r_utg.gfa > pgp1.asm.r_utg.fa;
./pstools hic_mapping_unitig -t64 pgp1.asm.r_utg.fa <(zcat ../data/PGP1/hic/SRR8310069_1.fastq.gz ../data/PGP1/hic/SRR8310062_1.fastq.gz) <(zcat ../data/PGP1/hic/SRR8310069_2.fastq.gz ../data/PGP1/hic/SRR8310062_2.fastq.gz); ./pstools resolve_haplotypes -t64 hic_name_connection.output pgp1.asm.r_utg.gfa ./; ./pstools hic_mapping_haplo -t64 pred_haplotypes.fa <(zcat ../data/PGP1/hic/SRR8310069_1.fastq.gz ../data/PGP1/hic/SRR8310062_1.fastq.gz) <(zcat ../data/PGP1/hic/SRR8310069_2.fastq.gz ../data/PGP1/hic/SRR8310062_2.fastq.gz) -o scaff_connections.txt; ./pstools haplotype_scaffold -t64 scaff_connections.txt pred_haplotypes.fa ./
Please update data files as per your use case.
In theory, it should work for plants. Please let me know if you have any questions.
Look forward to working with you.
i got a problem by using pstools,the length of scaffold0l_hap1 and scaffold1l_hap2 is very long in pre_hap1.fa file ,longer than expected the true length of chromosome (<200Mb) ,however, the pre_hap2.fa file split scaffold0l_hap1/scaffold1l_hap2 into different smaller scaffolds.thoes result show in blew,i dont why ,could you help me ,thanks~
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.