dfguan / asset Goto Github PK

assembly evaluation tool

Python 8.39% C 90.83% Makefile 0.78%

asset's Introduction

Asset

Assembly evaluation tool

Workflow

Asset can use sequencing data from four platforms (Pacbio, 10X, Bionano, HiC) to accumulate the support evidence for a de novo assembly. You can use Pipeline Guide to build your own pipeline.

Dependencies

C Library

zlib

Third Party Tools

minimap2
bwa
samtools
RefAligner
runner (optional, only necessary when you run my pipeline script run_asset.py)

Installation

git clone https://github.com/dfguan/asset.git
cd asset/src && make

There will be bin directory keeping all asset executables under asset after compiling successfully.

Pipeline Guide

Preprocessing

Given an assembly asm, use the following command to preprocess.

bin/detgaps $asm > $output_dir/gaps.bed

Pacbio Processing

Given a pacbio files list pblist and the assembly asm, apply the following command to get Pacbio support regions.

for fl in $pblist
do
	minimap2 -xmap-pb -t 12 $asm $fl > $fl.paf
done

bin/ast_pb $fl1.paf $fl2.paf $fl3.paf ... >$output_dir/pb.bed 2>ast_pb.log

10X Processing

Given a 10x read files list 10xlist (suppose in fastq.gz format) and the assembly asm, use the following command to get 10X support regions.

bwa index $asm
while read -r r1 r2
do
	prefix=`basename $r1 .fastq.gz`
	bin/10x -c -p $prefix $r1 $r2 # generate trimmed read files $prefix_{1,2}.fq.gz
	bwa mem -t12 $asm $prefix_1.fq.gz $prefix_2.fq.gz | samtools view -b - > $prefix.bam
done < $10xlist

bin/ast_10x $output_dir/gaps.bed $bam1 $bam2 $bam3 ... >$output_dir/10x.bed 2>ast_10x.log

10x_trim is available at dfguan/utls.

Bionano Processing

Consensus map (.cmap)

Given a bionano consensus map files list bnlist (suppose in .cmap format) and the assembly asm, use the following command to get Bionano support regions.


solve_dir=/nfs/users/nfs_d/dg30/luster_dg30/dg30/projects/vgp/tools/Solve3.2.1_04122018/
for fl in $bnlist
do
	fn=`basename $fl`
	fn_pref=`echo $fn | cut -d_ -f1`
	tech=`echo $fn | cut -d_ -f2`
	enzyme=`echo $fn | cut -d_ -f3`
	perl $solve_dir/HybridScaffold/04122018/scripts/fa2cmap.pl -n ${enzyme:0:4} -i ref -o $output_dir
	cp $fl $output_dir
	ref_prefix=${asm%.*}
	ref_cmap=$output_dir/fa2cmap/"$ref_prefix"_"$enzyme"_0Kb_0labels.cmap
	key_fn=$output_dir/fa2cmap/"$ref_prefix"_"$enzyme"_0Kb_0labels_key.txt
	query_cmap=$output_dir/$fn
	optn=${tech,,}
	if [ "$enzyme" = "DLE1" ]
	then
		optn="DLE1_"$optn
	fi
   python2 $solve_dir/Pipeline/04122018/runCharacterize.py -t   $solve_dir/RefAligner/7437.7523rel/RefAligner -q $query_cmap -r  $ref_cmap -p $solve_dir/Pipeline/04122018/ -a $solve_dir/RefAligner/7437.7523rel/optArguments_nonhaplotype_"$optn".xml -n 4
	map_path=$output_dir/alignref/${fn%.*} 
	rmap_fn="$map_path"_r.cmap
	qmap_fn="$map_path"_q.cmap
	xmap_fn="$map_path".xmap
	bin/ast_bion $rmap_fn $qmap_fn $xmap_fn $key_fn > $output_dir/bionano_"$tech"_"$enzyme".bed 2>ast_bion_"$tech"_"$enzyme".log
done
fln=`wc -l $bnlist | awk {print $1}`
# only consider one or two bionano files
if [ "$fln" -eq 2 ]
then
	bin/union bionano_*.bed > bn.bed
else
	cp bionano_*.bed bn.bed
fi

Molecular map (.bnx)

Update soon

HiC Processing

Given a HiC files list hiclist (suppose in fastq.gz format) and the assembly asm, use the following command to get Bionano support regions.

bin/split_fa $asm > split.fa
samtools faidx split.fa 
bwa index split.fa
while read -r r1 r2
do
	prefix=`basename $r1 .fq.gz`
	dirn=`dirname $r1`
	bwa mem -SP -B10 -t12 split.fa $r1 $r2 | samtools view -b - > $dirn/$prefix.bam
done < $hiclist
bin/col_conts *.bam > $output_dir/links.mat
bin/ast_hic2 split.fa.fai $output_dir/links.mat >$output_dir/hic2.bed 2>ast_hic.log

Evidence Accumulation

once you got the bed files from Pacbio (pb.bed), 10X (10x.bed), Bionano (bn.bed), HiC (hic.bed), run the following command to get accumulation bed file and detect break points.

bin/acc $output_dir/gaps.bed $output_dir/{pb,bn}.bed $output_dir/bn.bed > $output_dir/pb_bn.bed 
bin/acc $output_dir/gaps.bed $output_dir/{10x,hic2,bn}.bed > $output_dir/10x_hic2_bn.bed

Mis-assemblies Detection

bin/pchlst -c $output_dir/gaps.bed $output_dir/pb_bn.bed > $output_dir/pchlst_ctg.bed
bin/pchlst $output_dir/gaps.bed $output_dir/10x_hic2_bn.bed > $output_dir/pchlst_scaf.bed 
bin/union_brks $output_dir/gaps.bed $output_dir/pchlst_{ctg,scaf}.bed > $output_dir/pchlst_final.bed

Contact

Welcome to use, you can use github webpage to report an issue or email me [email protected] with any advice.

asset's People

Contributors

Stargazers

Watchers

Forkers

kant arangrhie yuzhenpeng pythseq ethering china-acadamy-ioz

asset's Issues

10X Chromium pipeline

Hi Dengfeng,

I am trying to evaluate a 10X Chromium assembly done with SUPERNOVA.

I can get the gap.bed file, but the following never finished:
#####################################################

bwa index Andean12nov19_SUPERNOVA_output_pseudohap2.1.fasta
while read -r r1 r2
do
prefix=basename $r1 .fastq.gz
/home/sgod/GenomicsPrograms/asset/bin/10x -c -p $prefix $r1 $r2 # generate trimmed read files $prefix_{1,2}.fq.gz
bwa mem -t12 Andean12nov19_SUPERNOVA_output_pseudohap2.1.fasta $prefix_1.fq.gz $prefix_2.fq.gz | samtools view -b - > $prefix.bam
done < $10xlist

/home/sgod/GenomicsPrograms/asset/bin/ast_10x gaps.bed $bam1 $bam2 $bam3 ... >10x.bed 2>ast_10x.log

#####################################################
My 10Xlist is a file with the directory and 4 fastq files from the original 10X sequencing (R1 and R2 X 2 lanes)

Is there something simple I am missing in the syntax?

Thank you for your input!
Sher

Having only two sources of data

Hi there,
I generated an assembly using hifi reads along with hi-c data but there were no additional data e.g. 10x or bn. The assembly is of high quality with N50 of 220 Mb and is free of gaps. I tried the following codes to check mis-joins and mis-assemblies but the pchlst_scaf.bed is empty. Accordingly, the output is not reasonable and there are many C in the final results.

./bin/acc gaps.bed pb.bed > pb_acc.bed
./bin/pchlst -c gaps.bed pb_acc.bed > pchlst_ctg.bed

./bin/acc gaps.bed pb.bed hic2.bed > hic_acc.bed
./bin/pchlst -c gaps.bed hic_acc.bed > pchlst_scaf.bed

./bin/union_brks gaps.bed pchlst_{ctg,scaf}.bed > pchlst_final.bed

Can you please see the codes and let me know if there is any mistake.
I have 10x reads from another individual. This individual is related to main sample that was used for assembly. I wonder if it is possible to use sequences from another sample to asses the genome assembly.
Thank you

zlib.h: No such file or directory

I got a error when I have runned make，below this error.

gcc -g -Wall -D VERBOSE -O2 -D PRINT_COVERAGE -c -o paf.o paf.c
paf.c:1:10: fatal error: zlib.h: No such file or directory
#include <zlib.h>
^~~~~~~~
compilation terminated.
make: *** [paf.o] Error 1

my zlib.h path is /usr/include/zlib.h，but i dont know how to fix this problem，i hope you can help me ,thank you!!

bin/col_conts don't find the script

Dear @dfguan,

I don't find the col_conts script in the asset webpage, do you have any clue where should I look at?

Thanks a lot!

Marc

Add descriptions of expected output in .bed files

Hello,

I'm using your tool to evaluate assemblies produced with PacBio HiFi reads, 10x reads, and Omni-C reads.

Can you provide descriptions of the expected output columns in the various bed files?

My final punchlist bed file has only "C"s present in the final column, which I assume means "conflict". I don't have any "S", which I'm assuming means "supported".

If I knew what the upstream file fields are, I can possibly troubleshoot there.

Thanks!
Kim

Output of ast_hic, Unexpected Truncation

Hi Dengfeng, @dfguan

I want to evaluate an assembly (not scaffolded yet) with a HiC data. It seems that using ast_hic I should be able to find the reliable blocks. I used the default parameters for ast_hic

Usage: ast_hic [options] <GAP_BED> <BAM_FILEs>
Options:
         -c    INT      minimum coverage [7]
         -C    INT      maximum coverage [inf]
         -q    INT      minimum alignment quality [0]
         -L    INT      maximum insertion length, gap excluded [15000]
         -h             help

So I expected a block to be present in the output BED file if at least 7 paired reads are spanning over it and those reads are having an insertion length less than 15Kb. I tried to investigate the output BED file for some regions manually. I found some regions that I couldn't explain why they are not present in the BED file.

Here is what I did:

detgaps asm.fa > gaps.bed
ast_hic gaps.bed hic.bam > hic.bed

The assembly and the bam file are attached HiC_asset.zip and you can reproduce the BED file. ( I reduced the whole assembly and bam file to just one contig for simplicity )

The first lines of the BED file is as below:

track name="HC" description="hic data"
h1tg000188l	302	205890
h1tg000188l	206041	206104
h1tg000188l	206177	206215
h1tg000188l	206234	215904

As you can see the locus 205900 is not included so there should be a problem with this locus. But It seems that the reads spanning 205880 (included) and 205900 (not included) are exactly the same.

To find the spanning reads I ran the line below once for pos=205900 and once for pos=205880 :
samtools view hic.bam | awk '{if (($4 < pos) && ($8 > pos)) print $0 }'

For either of them I got the same 7 paired reads (all of them are having insertion lengths less than 15Kb):

A00675:75:H5CNNDSXY:1:1166:22761:14544	81	h1tg000188l	192537	19	10S128M13S	=	206017	13354	AATATCAAAGTGTAATCCCAGCACTTTTGGAGGCCAAGGCAAGAGGATCACAAGGTGAGGAGATCAAGACCATCCTGGCCAATACAGTGAAACGCTGTCACTATTAAAAATCCAAAAAATTAGCTAGGCATGGTGGCAAGGTTCGTCCATC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF::,FFFFFFFFFFFFFFFFF:FFFFFFFFFFFF	NM:i:0	MD:Z:128	MC:Z:39S112M	AS:i:128	XS:i:117	XA:Z:h1tg000132l,-2800693,10S128M13S,1;h1tg000188l,-185735,10S128M13S,1;h1tg000132l,-2807711,10S128M13S,1;h1tg000144l,-2327679,10S128M13S,2;
A00675:75:H5CNNDSXY:2:1308:13982:24298	145	h1tg000188l	197070	0	50M4I1M4D92M3S	=	206108	8893	TACCCAAAATATGTATAATATACTGTACATAAATTATGGAAGTACCCAAAGATTCATAATCAACTGTACATAAAATAACAAAGTACTCAAACTATATATTCTATACAGTACATAAAATATCAAAGTACCCATACTGTATATTATATACTG	FFFFFFFFFF,FFFFFFFFF:FFFFFFFFFFFFF:F,FFFFFFFFFF::FF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:12	MD:Z:38A12^TTTT5A17T52A15	MC:Z:48S66M34S	AS:i:79	XS:i:78
A00675:75:H5CNNDSXY:2:1645:30481:33207	2177	h1tg000188l	199232	0	94H57M	=	206104	6873	CAAAATATGTATAATATACTGTACATAAATTATGAAAGTAACCAAACATTTATAATA	,FFF:,FF,,F,FFF:FFFF,FFFFFFFFF:F:FFFF,FF,FFFFF,:F,:FFFF:F	NM:i:1	MD:Z:40C16	MC:Z:5H88M58H	AS:i:46	XS:i:46	SA:Z:h1tg000006l,78306568,+,71M80S,49,1;
A00675:75:H5CNNDSXY:2:2671:9109:8876	2129	h1tg000188l	200829	0	106H32M13H	=	211484	10625	ACTGTACATAATATATTAAAGTCGCCTAAATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:32	MC:Z:151M	AS:i:32	XS:i:26	SA:Z:h1tg000157l,793787,+,95S56M,0,0;h1tg000188l,235190,-,55S62M34S,0,1;
A00675:75:H5CNNDSXY:3:2152:14000:15969	113	h1tg000188l	202107	60	82M69S	=	211218	9167	TATGTATTATATATTGTACATAAAATATCAAACTATGCCAAATATATATTATATACTGTACATAAAATATCAAATTACCCAATATGAAGCTTTTATTAAAAATTCTAGGAAGTGAATACTAATCTATAGTGAAAAAGCAGACTAGGAAGAC	FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:82	MC:Z:14S137M	AS:i:82	XS:i:54	SA:Z:h1tg000144l,5760537,+,70M81S,0,0;
A00675:75:H5CNNDSXY:4:1214:15637:20964	2161	h1tg000188l	203773	0	30M118H	=	207353	3674	TACATAAAATATCAAAGTACCCAAACTTCA	FFFFFF,FFFFFFFFFFFFFFFFFFFF:FF	NM:i:0MD:Z:30	MC:Z:28H100M1D1M1I21M	AS:i:30	XS:i:28	SA:Z:h1tg000188l,207450,+,89M59S,0,0;
A00675:75:H5CNNDSXY:4:1618:3486:17018	161	h1tg000188l	204379	43	151M	=	208464	4213	TCAAATATATATATTATTCTGTACATAAAATATCACATTACACCATATATATATTATATACTGTACATAAAATATCAAAGTACCCAAAATATGTATAATATACTGTATATAAATTATGAAAGTACCCAAACCTTTATAATAAAGTGTACAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFF:FFF:FF:FFFF:F	NM:i:0	MD:Z:151	MC:Z:23S128M	AS:i:151	XS:i:131	XA:Z:h1tg000188l,+199979,131M20S,0;h1tg000132l,+2822475,4S147M,2;

I also found these lines in HC.cov.base indicating a reduction in the coverage (from 7 to 6) at 205890. But I couldn't figure out why? Could you please explain what's happening at 205890 that makes ast_hic truncate the BED file there?

205863	205890	7
205891	205962	6

Thanks in advance,
Mobin

ONT and HiC only

Hi Dengfeng,

My output with pchlst_final.bed file looks strange. Could you take a look at my workflow to ensure I have run it correctly? I've also attached the outputs

MY WORKFLOW WITH ONT AND HiC data only:

/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/detgaps gapclosed.fasta > gaps.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/split_fa gapclosed.fasta > split.fa
samtools faidx split.fa
bwa index split.fa
bwa mem -SP -B10 -t 18 split.fa /nesi/project/landcare02730/HiC_Data/Rewa_trimgalore_output/Rewa_trimgalore_adaptortrim/Rewarewa_HiC_R1_trimmed_val_1.fq /nesi/project/landcare02730/HiC_Data/Rewa_trimgalore_output/Rewa_trimgalore_adaptortrim/Rewarewa_HiC_R2_trimmed_val_2.fq | samtools view -b - > asset_aln.bam
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/col_conts asset_aln.bam > links.mat
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/ast_hic2 split.fa.fai links.mat > hic2.bed 2>ast_hic.log

minimap2 -x map-ont -t 18 *.fasta /nesi/nobackup/uoa02642/Rewa_Assemblies/rewarewa_rebasecalled.fasta > ont.paf
/nesi/nobackup/uoa02642/Rewa_Assemblies/purge_dups/src/pbcstat ont.paf
/nesi/nobackup/uoa02642/Rewa_Assemblies/purge_dups/src/calcuts PB.stat > ont.cutoffs 2>calcults.log
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/ast_pb -m 5 -M 252 ont.paf > ont.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/acc gaps.bed hic2.bed ont.bed > hic2_ont.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/pchlst gaps.bed hic2_ont.bed > pchlst_scaf.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/pchlst -c gaps.bed hic2_ont.bed > pchlst_ctg.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/union_brks gaps.bed pchlst_scaf.bed pchlst_ctg.bed > pchlst_final.bed

Thanks for all the help and patience with this. Very much appreciated.

A
Re__Additional_query.zip

ast_10x fails

Hi,

I got stuck with the 10x steps of the Asset pipeline. All other parts of the pipeline (i.e. identifying gaps and ends, PacBio support, HiC support) work well.

However, when I get to the
$asset/bin/ast_10x -x gaps.bed $prefix.bam > 10x.bed
step, I get the following error message:

Program starts
[M::aa_10x] processing bam file
[M::proc_bam] finish processing 2121039345 bams
[M::proc_bam] read pair counter: 1000860477
[W::aa_10x] none useful information, quit

As I have a (admittingly unnecessary) large number of 10x reads, my $prefix.bam file is 260GB large. Also, what I saw is that in the gaps.bed file, all scaffolds have a gap at the first (0 0) and last position. Is that normal?

Best,
Oliver

pchlst_final.bed Result

in final line ,what`s means of the "S" and "C" ?